SDS 2024 - Cleantech RAGΒΆ

By Daniel Perruchoud and George Rowlands

IntroductionΒΆ

This notebook delves into the exciting realm of Cleantech using a dataset of nearly 10,000 news articles from Kaggle, all centered around the energy sector. We'll embark on a journey that includes data exploration, text preprocessing, and culminates in the creation of a Retrieval-Augmented Generation Pipeline (RAG). This powerful approach empowers us to construct an LLM (Large Language Model) that can intelligently answer user queries, drawing upon the knowledge from our curated news articles.

Why RAG? A Cost-Effective and Dynamic SolutionΒΆ

Fine-tuning an LLM can be a resource-intensive and inflexible process. RAG offers a compelling alternative. It leverages a semantic search to pinpoint relevant sections within our news articles that directly address a user's question. These retrieved sections are then provided to the LLM as context, enabling it to deliver informed and insightful responses.

rag_pipline

SetupΒΆ

To run this notebook we recommend downloading the provided GitHub repository and opening this notebook in Google Colab. To ensure a smooth experience, you'll need:

At the start of the notebook a data.zip will be downloaded from a Google Drive and unzipped. This will then provide you with files that contain checkpoints for all of the expensive processing sections such as chunking, generating embeddings and evaluating the pipeline with an LLM as a judge. This saves you money and a lot of time.

If you can't or don't want to run this notebook you can also view the completed notebook by opening the cleantech_rag.html file in your browser.

Unveiling the Depths of RAG PipelinesΒΆ

Throughout this notebook, we'll delve into the intricate workings of RAG pipelines. Prepare to explore:

Questions or Issues? We're Here to Help!

If you encounter any roadblocks or have questions, please don't hesitate to reach out to George Rowlands

Installing DependenciesΒΆ

%%writefile requirements.txt

chromadb==0.5.0
datasets==2.19.1
gdown==5.2.0
kaggle==1.6.1
langchain==0.2.0
langchain-community==0.2.0
langchain-experimental==0.0.59
langchain-openai==0.1.7
langdetect==1.0.9
lorem-text==2.1
nbformat>=4.2.0
plotly==5.22.0
pretty-jupyter==1.0
ragas==0.1.8
seaborn==0.13.2
sentence-transformers==3.0.0
spacy>=3.7
textstat==0.7.3
umap-learn==0.5.5
Overwriting requirements.txt
%pip install torch==2.3.0 --quiet --index-url https://download.pytorch.org/whl/cu121
Note: you may need to restart the kernel to use updated packages.
%pip install -r ./requirements.txt --quiet
Note: you may need to restart the kernel to use updated packages.
import json
import os
import warnings
import zipfile
from collections import Counter
from pathlib import Path
from typing import Dict, List

import chromadb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import torch
from chromadb import Collection, Documents, EmbeddingFunction, Embeddings
from datasets import Dataset
from dotenv import load_dotenv
from langdetect import detect
from lorem_text import lorem
from ragas import RunConfig, evaluate
from ragas.metrics import (faithfulness, answer_relevancy, context_relevancy, answer_correctness)
from spacy.lang.en import English
from textstat import flesch_reading_ease
from tqdm import tqdm
import umap

from langchain.chains.base import Chain
from langchain.text_splitter import RecursiveCharacterTextSplitter, TextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma, VectorStore
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.language_models import LLM
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

load_dotenv()
warnings.filterwarnings("ignore")
!gdown 1MoT_s_Zk4dzRRy7E7Va5ZuTROIOI1FfZ
with zipfile.ZipFile("data.zip", "r") as zip_file:
    zip_file.extractall()

Setting your OpenAI KeyΒΆ

This OpenAI Key is used for the following tasks:

To set it rename the .env-example file to .env and add the key in the provided slot.

openai_key = "sk-XXXXXXXXXXXXXXXX"

Setting up our LLMΒΆ

To make sure our OpenAI Key is working we will test it by generating a response from GPT-4-turbo which we will later on also be using in our RAG pipeline. Try some different prompts or questions to see how the model responds.

llm = ChatOpenAI(model="gpt-3.5-turbo")
question_prompt = ChatPromptTemplate.from_template(
    "Answer the following question: {question}")
question_chain = question_prompt | llm | StrOutputParser()
question_chain.invoke({"question": "What is the meaning of life?"})
'The meaning of life is a complex and subjective concept that varies from person to person. Some may believe that the meaning of life is to seek happiness and fulfillment, others may see it as a journey of self-discovery and personal growth, while others may find meaning in their relationships with others or in their contributions to society. Ultimately, the meaning of life is a deeply personal and individual question that each person must explore and define for themselves.'

Downloading the Dataset from KaggleΒΆ

We will be exploring the following Cleantech Media Dataset. If you have opened this notebook as recommended by opening the provided Github repository in Google Colab then you don't need to to download the dataset. It should already be under data/bronze. If not then you can either manually download it and upload it into a data/bronze folder or follow the steps below.

Using the Kaggle APIΒΆ

We will be using the Kaggle API to download the data.

To use the Kaggle API you will need a Kaggle account. If you don't already have one, sign up for a Kaggle account at https://www.kaggle.com. When you are logged in, go to the 'Settings' tab of your user profile https://www.kaggle.com/settings and select 'Create New Token'. This will trigger the download of kaggle.json, a file containing your API credentials.

You can then add your Kaggle username and key from the kaggle.json file to the .env file just like with the OpenAI Key.

data_folder = Path("./data")
if not data_folder.exists():
    data_folder.mkdir()
bronze_folder = data_folder / "bronze"
if not bronze_folder.exists():
    bronze_folder.mkdir()
%%script echo skipping
kaggle_user = "XXXXXXXXXXXXXXXX"
kaggle_key = "XXXXXXXXXXXXXXXX"
skipping
%%script echo skipping
os.system(f"kaggle datasets download -d jannalipenkova/cleantech-media-dataset -p {bronze_folder}")
skipping
%%script echo skipping
with zipfile.ZipFile(bronze_folder / "cleantech-media-dataset.zip", "r") as zip_file:
    zip_file.extractall(bronze_folder)
skipping

Loading the Dataset into DataframesΒΆ

articles_df = pd.read_csv(
    bronze_folder / "cleantech_media_dataset_v2_2024-02-23.csv",
    encoding='utf-8', index_col=0)
articles_df.head()
title date author content domain url
1280 Qatar to Slash Emissions as LNG Expansion Adva... 2021-01-13 NaN ["Qatar Petroleum ( QP) is targeting aggressiv... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1281 India Launches Its First 700 MW PHWR 2021-01-15 NaN ["β€’ Nuclear Power Corp. of India Ltd. ( NPCIL)... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1283 New Chapter for US-China Energy Trade 2021-01-20 NaN ["New US President Joe Biden took office this ... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1284 Japan: Slow Restarts Cast Doubt on 2030 Energy... 2021-01-22 NaN ["The slow pace of Japanese reactor restarts c... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
1285 NYC Pension Funds to Divest Fossil Fuel Shares 2021-01-25 NaN ["Two of New York City's largest pension funds... energyintel https://www.energyintel.com/0000017b-a7dc-de4c...
human_eval_df = pd.read_csv(
    bronze_folder / "cleantech_rag_evaluation_data_2024-02-23.csv",
    encoding='utf-8', index_col=0)
human_eval_df.head()
question_id question relevant_chunk article_url
example_id
1 1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... https://www.sgvoice.net/strategy/technology/23...
2 2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... https://www.sgvoice.net/policy/25396/eu-seeks-...
3 2 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... https://www.pv-magazine.com/2023/02/02/europea...
4 3 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... https://www.sgvoice.net/policy/25396/eu-seeks-...
5 4 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... https://cleantechnica.com/2023/05/08/general-m...

Explorative Data Analysis & PreprocessingΒΆ

As the saying goes, "garbage in, garbage out." In the realm of machine learning, the quality of our outputs hinges on the quality of our inputs. This section delves into the essential processes of Exploratory Data Analysis (EDA) and data preprocessing. Through EDA, we'll illuminate the characteristics, patterns, and potential quirks residing within our cleantech news article dataset. Preprocessing will ensure our data is cleansed, structured, and prepared to be effectively utilized by the RAG pipeline, laying the foundation for high-quality results.

Let us start by gaining an overview of the datasets features (columns).

articles_df.describe()
title date author content domain url
count 9593 9593 31 9593 9593 9593
unique 9569 967 7 9588 19 9593
top Cleantech Thought Leaders Series 2023-05-04 Michael Holder ['Geopolitics as much as price or quality will... cleantechnica https://www.energyintel.com/0000017b-a7dc-de4c...
freq 5 427 8 2 1861 1
articles_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9593 entries, 1280 to 81816
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    9593 non-null   object
 1   date     9593 non-null   object
 2   author   31 non-null     object
 3   content  9593 non-null   object
 4   domain   9593 non-null   object
 5   url      9593 non-null   object
dtypes: object(6)
memory usage: 524.6+ KB

Our initial exploration reveals that the "author" column only contains data for 31 out of 9593 articles. Since this offers minimal information gain, we can remove this feature.

We've also observed that some titles and content entries appear to be non-unique. This might necessitate identifying and removing duplicate entries.

On a positive note, the article URLs are all unique, potentially serving as suitable unique identifiers for the data.

articles_df = articles_df.drop(columns=["author"])

Article DomainsΒΆ

The dataset helpfully provides the domain names extracted from the article URLs. These domains essentially represent the publishers of the news articles. Let's analyze the distribution of publishers and see how many articles each publisher has contributed.

domain_counts = articles_df["domain"].value_counts()
domain_counts
domain
cleantechnica            1861
azocleantech             1627
pv-magazine              1206
energyvoice              1017
solarindustrymag          673
naturalgasintel           658
thinkgeoenergy            645
rechargenews              559
solarpowerworldonline     505
energyintel               234
pv-tech                   232
businessgreen             158
greenprophet               80
ecofriend                  38
solarpowerportal.co        34
eurosolar                  28
decarbxpo                  19
solarquarter               17
indorenergy                 2
Name: count, dtype: int64
barplot = sns.barplot(
    x=domain_counts.values, 
    y=domain_counts.index,
    hue=domain_counts.index
)

barplot.set_title('Article Counts by Domain')
barplot.set_xlabel('Article Count')
barplot.set_ylabel('Domain')

plt.show()
No description has been provided for this image

Our exploration of article domains reveals a skewed distribution. Publishers like cleantechnica have a significantly higher representation (1861 articles), while others like indoenergy have minimal contributions (2 articles). If we proceed with sampling this data, this imbalance should be taken into account. Stratified sampling could be a viable approach to ensure a representative sample across different publishers.

Article DatesΒΆ

Each article within the dataset is accompanied by a publication date. Let's delve into the temporal range of these articles and investigate any noteworthy patterns in publication trends.

# plot the amount of articles over time
articles_df["date"] = pd.to_datetime(articles_df["date"])
time_df = articles_df.groupby("date").size().reset_index()
time_df.columns = ["date","count"]

time_df.describe()
date count
count 967 967.000000
mean 2022-06-01 19:11:06.390899456 9.920372
min 2021-01-01 00:00:00 1.000000
25% 2021-09-11 12:00:00 4.000000
50% 2022-06-06 00:00:00 9.000000
75% 2023-02-14 12:00:00 13.000000
max 2023-12-05 00:00:00 427.000000
std NaN 15.206340
sns.lineplot(data=time_df, x="date", y="count")
plt.title("Article Count Over Time")
plt.xlabel("Date")
plt.xticks(rotation=90)
plt.ylabel("Article Count")
# add a line for the average
avg_count = time_df["count"].mean()
plt.axhline(avg_count, color='r', linestyle='--', label=f"Average article count per day: {avg_count:.2f}")
plt.legend()
plt.show()
No description has been provided for this image

While the daily article count appears consistent overall, a significant outlier disrupts the pattern on the 2023-12-05. The cause of this outlier is undetermined, but it could potentially be the date the data was scraped and the default value assigned for missing dates. Since the publication date is not crucial for RAG pipeline, we can remove it.

articles_df = articles_df.drop(columns=["date"])

Article TitlesΒΆ

As noted in our initial exploration, some articles share identical titles. Here, we'll focus on identifying and handling these duplicate titles to ensure a clean and consistent dataset for our RAG pipeline.

sns.histplot(articles_df["title"].str.len())
plt.title("Title Length Distribution")
plt.xlabel("Title Length")
plt.ylabel("Count")
avg_count = articles_df["title"].str.len().mean()
plt.axvline(avg_count, color='r', linestyle='--', label=f"Average title length: {avg_count:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
articles_df["title"].duplicated().sum()
24
duplicate_titles = articles_df[articles_df["title"].duplicated(keep=False)].sort_values("title")
duplicate_titles.head(10)
title content domain url
6654 Aberdeen’ s NZTC plans national centre for geo... ['Aberdeen’ s NZTC is planning a national cent... energyvoice https://www.energyvoice.com/renewables-energy-...
6660 Aberdeen’ s NZTC plans national centre for geo... ['Aberdeen’ s NZTC is planning a national cent... energyvoice https://sgvoice.energyvoice.com/strategy/techn...
38593 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cross
38599 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
38596 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
38598 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
38597 About David J. Cross ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/authors/david-cro...
6704 BEIS mulls ringfenced CfD support for geotherm... ['Ministers are considering whether geothermal... energyvoice https://sgvoice.energyvoice.com/policy/21121/b...
6702 BEIS mulls ringfenced CfD support for geotherm... ['Ministers are considering whether geothermal... energyvoice https://www.energyvoice.com/renewables-energy-...
37040 Cleantech Insights from Industry Series ["By clicking `` Allow All '' you agree to the... azocleantech https://www.azocleantech.com/Insights.aspx?page=2
duplicate_titles["content"].duplicated().sum()
0

Our exploration identified 24 titles that appear multiple times in the dataset. Examples include "About David J. Cross." Interestingly, while the titles are identical, the content itself appears to be unique.

Here are some additional observations for further investigation:

def wrap_text(text: str, char_per_line=100) -> str:
    # for better readability, wrap the text at the last space before the char_per_line
    if len(text) < char_per_line:
        return text
    else:
        return text[:char_per_line].rsplit(' ', 1)[0] + '\n' + wrap_text(text[len(text[:char_per_line].rsplit(' ', 1)[0])+1:], char_per_line)
print(duplicate_titles.iloc[0]["title"])
print(wrap_text(duplicate_titles.iloc[0]["content"]))
Aberdeen’ s NZTC plans national centre for geothermal energy
['Aberdeen’ s NZTC is planning a national centre to accelerate geothermal energy in the UK and
become the β€œ go-to ” hub globally for the renewable technology.', 'Calum Watson, senior project
engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to
help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTC’ s new β€œ National
Geothermal Innovation Centre ” would develop technology and help create β€œ bespoke regulation ” for
geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said
geothermal could account for 20% of Britain’ s energy mix, slashing carbon emissions in the
process.', 'Geothermal is a burgeoning technology – which has been picked up in some countries like
Iceland and the Philippines – which harnesses heat in the subsurface of the earth to generate
electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and
drilling.', 'However a report published this week by trade body Offshore Energies UK said there are
2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade – which Mr
Watson described as a β€œ massive opportunity ” for geothermal', 'Based at a β€œ north-east location ”,
the new hub would be the β€œ go to centre globally for geothermal technology challenges but,
crucially, would be world-leading in supporting government, and creating legislation and best
practice for geothermal ”.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on
Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He
said it would be achieved through a β€œ partner-led roadmap ” akin to the NZTC itself – which is
funded with Β£180m of UK and Scottish Government funding – and ultimately be powered by geothermal
energy.', 'The national base would comprise a β€œ solution centre ” to scale up technologies from
pilot stage.', 'It would also have a knowledge hub to share learnings and an β€œ accelerator
programme ” to fund start-ups.', 'The NZTC has already dipped its toe into the tech – supporting a
β€œ first of its kind ” test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson
set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for
oil and gas workers to transfer to the sustainable technology.', 'β€œ ( By 2030) we want the centre
to have delivered geothermal energy, accounting for 5% of the UK’ s energy mix and on route for 20%
by 2050.', 'β€œ We would have multiple demonstrators successfully delivered to showcase and educate
and, long term, the center will be run on geothermal energy.']
print(duplicate_titles.iloc[1]["title"])
print(wrap_text(duplicate_titles.iloc[1]["content"]))
Aberdeen’ s NZTC plans national centre for geothermal energy
['Aberdeen’ s NZTC is planning a national centre to accelerate geothermal energy in the UK and
become the β€œ go-to ” hub globally for the renewable technology.', 'Calum Watson, senior project
engineer at the Net Zero Technology Centre, has outlined ambitions for the oil and gas industry to
help ramp up the clean energy, both onshore and in the North Sea.', 'The NZTC’ s new β€œ National
Geothermal Innovation Centre ” would develop technology and help create β€œ bespoke regulation ” for
geothermal, with the aim of it providing 5% of UK energy needs by 2030.', 'By 2050, Mr Watson said
geothermal could account for 20% of Britain’ s energy mix, slashing carbon emissions in the
process.', 'Geothermal is a burgeoning technology – which has been picked up in some countries like
Iceland and the Philippines – which harnesses heat in the subsurface of the earth to generate
electricity.', 'Some barriers to its uptake include expensive up-front costs like exploration and
drilling.', 'However a report published this week by trade body Offshore Energies UK said there are
2,100 offshore oil and gas wells to be decommissioned in the North Sea next decade – which Mr
Watson described as a β€œ massive opportunity ” for geothermal', 'Based at a β€œ north-east location ”,
the new hub would be the β€œ go to centre globally for geothermal technology challenges but,
crucially, would be world-leading in supporting government, and creating legislation and best
practice for geothermal ”.', 'Speaking at the Offshore Decommissioning Conference in St Andrews on
Tuesday, Mr Watson did not disclose whether the plan had backers or when it might be set up.', 'He
said it would be achieved through a β€œ partner-led roadmap ” akin to the NZTC itself – which is
funded with Β£180m of UK and Scottish Government funding – and ultimately be powered by geothermal
energy.', 'The national base would comprise a β€œ solution centre ” to scale up technologies from
pilot stage.', 'It would also have a knowledge hub to share learnings and an β€œ accelerator
programme ” to fund start-ups.', 'The NZTC has already dipped its toe into the tech – supporting a
β€œ first of its kind ” test project for the EnQuest Magnus platform in the North Sea.', 'Mr Watson
set out his hopes for what the centre could achieve by 2030, and highlighted the opportunity for
oil and gas workers to transfer to the sustainable technology.', 'β€œ ( By 2030) we want the centre
to have delivered geothermal energy, accounting for 5% of the UK’ s energy mix and on route for 20%
by 2050.']

Our analysis suggests potential redundancy within certain articles. In some cases, the second article might appear to be the first article with an additional sentence appended at the end.

Let's take a closer look at these "energyvoice" articles and how the contents start and see if we can eliminate these redundancies.

energyvoice_articles = articles_df[articles_df["domain"].str.contains("energyvoice")]
energyvoice_articles.content.map(lambda x: x[:50]).value_counts()
content
['', '', 'The Megawatt Hour is the latest podcast     6
['A group of trade associations from across the en    3
['Two years after the Amazon Pledge Fund invested     3
['The latest analysis shows that capital flows tow    2
['Macquarie Group is betting the North Sea – engin    2
                                                     ..
['Now more than ever – in terms of cost and the im    1
['Scientists have hailed a helium discovery which     1
['Marine equipment fabrication and rental speciali    1
['The Russian powers behind oil explorers Exillon     1
['Aberdeen-headquartered Repsol Sinopec Resources     1
Name: count, Length: 980, dtype: int64
def remove_prefix_articles(df: pd.DataFrame, prefix_len: int = 100) -> pd.DataFrame:
    """
    Takes O(n^2) time complexity
    If the first {prefix_len} characters of the article are the same, then we consider them as a prefix. 
    If an article is a prefix of a longer article, then we remove it.
    If an article is a prefix of longer article, but they have different titles, then we keep them.
    """

    df["char_len"] = df["content"].map(len)
    df = df.sort_values(by='char_len', ascending=True).reset_index(drop=True)

    # Initialize a list to keep the articles that are not prefixes of others
    non_prefix_articles = []

    for i, row in df.iterrows():
        is_prefix = False
        content_i = row['content'][:prefix_len]
        title_i = row['title']

        for j in range(i + 1, len(df)):
            content_j = df.at[j, 'content'][:prefix_len]
            title_j = df.at[j, 'title']

            if content_i == content_j:
                # If the prefix matches but the titles are different, we keep it
                if title_i != title_j:
                    continue
                else:
                    is_prefix = True
                    break

        if not is_prefix:
            non_prefix_articles.append(row)

    print(f"Removed {len(df) - len(non_prefix_articles)} prefix articles")
    return pd.DataFrame(non_prefix_articles)
energyvoice_articles = remove_prefix_articles(energyvoice_articles)
energyvoice_articles.content.map(lambda x: x[:100]).value_counts()
Removed 11 prefix articles
content
['', '', 'The Megawatt Hour is the latest podcast boxset brought to you by Energy Voice Out Loud in     6
['Two years after the Amazon Pledge Fund invested in Hippo Harvest, the company is selling its first    3
['A group of trade associations from across the energy sector have written to the Chancellor urging     3
['Global Port Services has confirmed the award of multiple contracts in support of the Seagreen wind    2
['DNV report shows Jotun’ s Baltoflake solution offers beyond 30 years’ protection for offshore asse    2
                                                                                                       ..
['The deal volume for renewable energy assets in Asia more than tripled to $ 13.6 billion in 2021, a    1
['Several young energy professionals have undertaken a voyage across Scotland to spotlight the count    1
['A UK-backed research group unveiled a design for a liquid hydrogen-powered airliner theoretically     1
['UK-listed Pharos Energy is excited about its upcoming Vietnam activities with a 3D seismic shoot l    1
['With the greatest and most urgent energy transition in human history accelerating, the quest for n    1
Name: count, Length: 981, dtype: int64

There still seem to be be some redundancy, but we did manage to remove 11 duplicates.

Article ContentsΒΆ

Having explored various aspects of our dataset, we now turn our attention to the heart of the matter: the article content itself. This section will delve into the analysis and preprocessing techniques we'll employ to ensure the content is high-quality and effectively utilized by our RAG pipeline.

np.random.seed(7)
random_sample_id = np.random.choice(articles_df.index)
print(wrap_text(articles_df.loc[random_sample_id, "content"]))
['Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its
partner Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ
batteries to customers across the United States.', 'The strategic relationship with Lumio will
amplify the impact and distribution of Enphase systems, providing homeowners more access to
reliable, sustainable and grid-independent power sources, the company says.', 'β€œ We are excited
about Enphase’ s full suite of products – including microinverters, batteries and EV chargers –
that can provide our customers best-in-class home energy management solutions, ” says Greg
Butterfield, CEO at Lumio. β€œ Additionally, the Enphase digital platform, from lead generation to
permitting to ongoing operations and maintenance services, offers a unique ability for Lumio to
increase efficiencies and reduce costs. ”', 'For homeowners who want battery backup, there are no
sizing restrictions on pairing Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump
Start feature can restart a home energy system – switching to sunlight-only after prolonged grid
outages that may result in a fully depleted battery. This eliminates the need for a manual restart
of the system and gives homeowners greater assurance of energy resilience.', 'β€œ This strategic
relationship with Enphase makes it easier for Lumio’ s customers to take control of their power
production, power consumption, and increase the security and reliability of their family’ s power
supply, ” adds David Schonberg, senior vice president of energy partnerships at Lumio.', 'Solar
Industry offers industry participants probing, comprehensive assessments of the technology, tools
and trends that are driving this dynamic energy sector. From raw materials straight through to
end-user applications, we capture and analyze the critical details that help professionals stay
current and navigate the solar market.', 'Β© Copyright Zackin Publications Inc. All Rights
Reserved.']

Our initial examination reveals that article content is currently stored as a list of strings. To gain deeper understanding and facilitate preprocessing, we'll transform these lists into a more cohesive textual format.

articles_df['article'] = articles_df['content'].apply(lambda x: ' '.join(eval(x)))
print(wrap_text(articles_df.loc[random_sample_id, "article"]))
Enphase Energy Inc., a supplier of microinverter-based solar and battery systems, says its partner
Lumio will be significantly expanding its offering of Enphase IQ8 microinverters and IQ batteries
to customers across the United States. The strategic relationship with Lumio will amplify the
impact and distribution of Enphase systems, providing homeowners more access to reliable,
sustainable and grid-independent power sources, the company says. β€œ We are excited about Enphase’ s
full suite of products – including microinverters, batteries and EV chargers – that can provide our
customers best-in-class home energy management solutions, ” says Greg Butterfield, CEO at Lumio. β€œ
Additionally, the Enphase digital platform, from lead generation to permitting to ongoing
operations and maintenance services, offers a unique ability for Lumio to increase efficiencies and
reduce costs. ” For homeowners who want battery backup, there are no sizing restrictions on pairing
Enphase IQ batteries with IQ8 microinverters, and the Sunlight Jump Start feature can restart a
home energy system – switching to sunlight-only after prolonged grid outages that may result in a
fully depleted battery. This eliminates the need for a manual restart of the system and gives
homeowners greater assurance of energy resilience. β€œ This strategic relationship with Enphase makes
it easier for Lumio’ s customers to take control of their power production, power consumption, and
increase the security and reliability of their family’ s power supply, ” adds David Schonberg,
senior vice president of energy partnerships at Lumio. Solar Industry offers industry participants
probing, comprehensive assessments of the technology, tools and trends that are driving this
dynamic energy sector. From raw materials straight through to end-user applications, we capture and
analyze the critical details that help professionals stay current and navigate the solar market. Β©
Copyright Zackin Publications Inc. All Rights Reserved.
articles_df["article"].duplicated().sum()
5
duplicate_articles = articles_df[articles_df["article"].duplicated(keep=False)].sort_values("article")
duplicate_articles
title content domain url article
78215 China's wind giants are chasing global growth:... ['Geopolitics as much as price or quality will... rechargenews https://www.rechargenews.com/wind/chinas-wind-... Geopolitics as much as price or quality will d...
78216 Why geopolitics will set the limits of China's... ['Geopolitics as much as price or quality will... rechargenews https://www.rechargenews.com/wind/why-geopolit... Geopolitics as much as price or quality will d...
80067 Sodium-ion battery production capacity to grow... ['Global demand for sodium-ion batteries is ex... pv-magazine https://www.pv-magazine.com/2023/07/17/sodium-... Global demand for sodium-ion batteries is expe...
80073 Sodium-ion battery fleet to grow to 10 GWh by ... ['Global demand for sodium-ion batteries is ex... pv-magazine https://www.pv-magazine.com/2023/07/17/sodium-... Global demand for sodium-ion batteries is expe...
6685 Indonesia seeks investors for giant geothermal... ['Indonesia, home to the world’ s largest geot... energyvoice https://www.energyvoice.com/oilandgas/467719/i... Indonesia, home to the world’ s largest geothe...
6689 Indonesia seeks investors for giant geothermal... ['Indonesia, home to the world’ s largest geot... energyvoice https://sgvoice.energyvoice.com/investing/2002... Indonesia, home to the world’ s largest geothe...
78225 Quest for endless green energy from Earth's co... ['One of Japan’ s largest utility groups Chubu... rechargenews https://www.rechargenews.com/energy-transition... One of Japan’ s largest utility groups Chubu E...
78227 Limitless green energy from Earth's core quest... ['One of Japan’ s largest utility groups Chubu... rechargenews https://www.rechargenews.com/news/2-1-1487279 One of Japan’ s largest utility groups Chubu E...
78210 Portugal energy transition plan targets massiv... ['Portugal has more than doubled its 2030 goal... rechargenews https://www.rechargenews.com/energy-transition... Portugal has more than doubled its 2030 goals ...
78212 Wind, hydrogen and solar fused in Portugal's p... ['Portugal has more than doubled its 2030 goal... rechargenews https://www.rechargenews.com/energy-transition... Portugal has more than doubled its 2030 goals ...

Our analysis uncovers additional insights regarding content duplication. We observe cases where seemingly identical articles are reposted on the same domain but with different titles (excluding the "sgvoice.energyvoice.com" vs. "energyvoice.com" scenario previously addressed). Here, we'll strategically keep these duplicates where contents are the same but titles are different.

Importance of Titles

We keep these duplicate articles because titles can hold significant relevance for our RAG pipeline. Consider a scenario where a user query uses an abbreviation, while the corresponding article only contains the abbreviation in the title, in the content always the full term is used. To bridge this gap, we'll prepend titles to the article content during preprocessing. This ensures that the retrieval process considers not only the content itself, but also the potentially informative titles.

Next Step

As previously noted, some articles exhibit standardized introductions, possibly artifacts of the data scraping process. We'll develop appropriate techniques to handle these introductions during preprocessing, ensuring they don't hinder the effectiveness of our RAG pipeline.

articles_df.article.map(lambda x: x[:50]).value_counts()
article
By clicking `` Allow All '' you agree to the stori    1627
Sign in to get the best natural gas news and data.     658
window.dojoRequire ( [ `` mojo/signup-forms/Loader      52
None of these red flags by themselves make a compa      19
Volkswagen ID.4 sales were up 254% in the 1st quar      14
                                                      ... 
You want to invest in renewable energy or a better       1
The best way to deal with carbon is not to release       1
When there is deflation, the prices of goods in th       1
Stickers are excellent products to leverage in bot       1
Arevon Energy Inc. has closed financing on the Vik       1
Name: count, Length: 6765, dtype: int64
artifacts = [
    "By clicking `` Allow All '' you agree to the sto",
    "Sign in to get the best natural gas news and dat",
    "window.dojoRequire ( [ `` mojo/signup-forms/Load"
]

for artifact in artifacts:
    print(wrap_text(articles_df[articles_df.article.str.startswith(artifact)].article.iloc[0][:500]))
    print()
By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site
navigation, analyse site usage and support us in providing free open access scientific content.
More info. Nel Hydrogen is committed to pushing the boundaries of science and continues to support
the research and development of new and innovative technologies. A group of leading researchers and
two employees of Proton Energy Systems, Inc., a subsidiary of Nel ASA ( Nel Hydrogen) have recently
published 

Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily
emails. Your email address * Your password * Remember me Continue Reset password Featured Content
News & Data Services Client Support Bidweek Markets | Natural Gas Prices | NGI All News Access
Major fluctuations in the latest weather models resulted in big swings in natural gas bidweek
prices, with solid gains on the East Coast and out West. However, much of the country’ s midsection
posted hefty 

window.dojoRequire ( [ `` mojo/signup-forms/Loader '' ], function ( L) { L.start ( { `` baseUrl '':
'' mc.us4.list-manage.com '', '' uuid '': '' 2a6df7ce0f3230ba1f5efe12c '', '' lid '': '' 1e23cc3ebd
'', '' uniqueMethods '': true }) }) American consumers are more concerned about the planet than
steady economic growth, new report. Your company wants to be a part of this. What steps do you
take? Each company should create detailed reports that evaluate the environmental impact of the
business, num

def remove_scrapping_artifacts(df: pd.DataFrame, column: str) -> pd.DataFrame:
    text_artifacts = [
        "By clicking `` Allow All '' you agree to the storing of cookies on your device to enhance site navigation, analyse site usage and support us in providing free open access scientific content. More info.",
        "Sign in to get the best natural gas news and data. Follow the topics you want and receive the daily emails. Your email address * Your password * Remember me Continue Reset password Featured Content News & Data Services Client Support"
    ]

    regex_artifacts = [
        r"window.dojoRequire \( \[ .*\}\) \}\) "
    ]

    for pattern in text_artifacts:
        articles_df[column] = articles_df[column].str.replace(pattern, '', regex=False)

    for pattern in regex_artifacts:
        articles_df[column] = articles_df[column].str.replace(pattern, '', regex=True)

    return df
articles_df = remove_scrapping_artifacts(articles_df, "article")
articles_df.article.map(lambda x: x[:50]).value_counts()
article
 Daily GPI Energy Transition | Infrastructure | NG    38
 Daily GPI E & P | NGI All News Access The U.S. na    36
 Daily GPI Energy Transition | NGI All News Access    28
None of these red flags by themselves make a compa    19
 Daily GPI Markets | Natural Gas Prices | NGI All     17
                                                      ..
 Award winning cleantech firm Aceleron’ s repairab     1
 Generating safe, green energy is one thing but pr     1
 Countries around the world need to move further a     1
 The sun is arguably the most important renewable      1
Arevon Energy Inc. has closed financing on the Vik     1
Name: count, Length: 8749, dtype: int64

Our efforts have successfully eliminated a substantial portion of the scrapping artifacts within the articles. However, some traces still persist, likely remnants of past website navigation structures. While removing these remaining artifacts could offer further refinement, it also presents a significant challenge. Therefore, we'll acknowledge this for now and move onto further preprocessing such as filtering out articles that are not in english.

articles_df["lang"] = articles_df["article"].map(detect)
articles_df["lang"].value_counts()
lang
en    9588
de       4
ru       1
Name: count, dtype: int64
articles_df[articles_df["lang"] != "en"]
title content domain url article lang
8283 International Energy Storage Conference ( IRES... ['EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz ... eurosolar https://www.eurosolar.de/2021/01/26/internatio... EUROSOLAR veranstaltet vom 16. bis 18. MΓ€rz 20... de
8304 Open Letter to Presidents Putin, Biden, Zelens... ['EUROSOLAR, the European Association for Rene... eurosolar https://www.eurosolar.de/sektionen/russland/ EUROSOLAR, the European Association for Renewa... ru
8307 Internationale Konferenz fΓΌr Energiespeicher m... ['Die nun zu Ende gegangene β€ž Internationale E... eurosolar https://www.eurosolar.de/2022/09/26/internatio... Die nun zu Ende gegangene β€ž Internationale Ern... de
8308 Presentations, Poster and Photos of the IRES 2022 ['Photos from the IRES ( Copyright EUROSOLAR e... eurosolar https://www.eurosolar.de/2022/10/20/presentati... Photos from the IRES ( Copyright EUROSOLAR e.V... de
24652 SMS group liefert Prozesstechnologie fΓΌr das e... ['Β© SMS group liefert Prozesstechnologie fΓΌr d... decarbxpo https://www.decarbxpo.com/en/News_Media/Magazi... Β© SMS group liefert Prozesstechnologie fΓΌr das... de
print(wrap_text(articles_df[articles_df["lang"] != "en"].iloc[1]["article"][1000:]))
 suffering and misery for over a century, while distracting from the one common enemy threatening
to consume all: accelerated fossil fueled climate heating. The Ukraine’ s EUROSOLAR section and its
networks have long advocated a new age with renewable energy in Eastern Europe. Together with all
of our other sections and members across the European continent, from Russia to the Netherlands,
and from Turkey to Denmark, EUROSOLAR offers this Climate Peace Platform. Prof. Peter Droege,
President of EUROSOLAR: β€œ The time has come for Climate Peace Diplomacy, to confront everyone’ s
common enemy: advanced fossil climate destabilization. This is one of ten actions presented by
EUROSOLAR as the main agenda of our time. ” Dr. Brigitte Schmidt, Vice President and Board Member
of EUROSOLAR Germany: β€˜ The time for renewable peace has come, part of our Regenerative Earth
Decade program. It stands for rethinking and peaceful action for our common future.’ Since its very
foundation in 1988 EUROSOLAR has worked to end fossil fuel wars through the great switch to 100%
renewable energy. In the words of Hermann Scheer ( 1944-2010), founder of EUROSOLAR: β€˜ Renewable
energies build peace’. The age of fossil-nuclear threats must end, the existential focus must
begin: www.earthdecade.org. EUROSOLAR also calls for a shift in thinking towards climate peace
diplomacy that recognizes and combats fossil dependencies as humanity’ s greatest common enemy.
https:
//www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom
cy/ Π’Ρ–Π΄ΠΊΡ€ΠΈΡ‚ΠΈΠΉ лист ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚Π°ΠΌ ΠŸΡƒΡ‚Ρ–Π½Ρƒ, Π‘Π°ΠΉΠ΄Π΅Π½, Π—Π΅Π»Π΅Π½ΡΡŒΠΊΠΈΠΉ Ρ– Π›ΡƒΠΊΠ°ΡˆΠ΅Π½ΠΊΠΎ: Eurosolar, Π„Π²Ρ€ΠΎΠΏΠ΅ΠΉΡΡŒΠΊΠ°
асоціація Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ” Π΄ΠΎ Π½Π΅Π³Π°ΠΉΠ½ΠΎΠ³ΠΎ припинСння вогню Ρ‚Π° постійної ΠΌΠΈΡ€Π½ΠΎΡ—
ΡƒΠ³ΠΎΠ΄ΠΈ ΠΏΠΎ всій Π‘Ρ…Ρ–Π΄Π½Ρ–ΠΉ Π„Π²Ρ€ΠΎΠΏΡ–, Π±Π΅Ρ€ΡƒΡ‡ΠΈ ΡƒΡ‡Π°ΡΡ‚ΡŒ Ρƒ всСсторонній ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½Ρ–ΠΉ ΠΌΠΈΡ€Π½Ρ–ΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚Ρ–Ρ—. Напад
Ρ€ΠΎΡΡ–ΠΉΡΡŒΠΊΠΈΡ… Π²Ρ–ΠΉΡΡŒΠΊΠΎΠ²ΠΈΡ… Π½Π° ΡƒΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠΈΠΉ Π½Π°Ρ€ΠΎΠ΄ Ρ– ΠΉΠΎΠ³ΠΎ уряд ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π±ΡƒΡ‚ΠΈ засудТСний Π½Π°ΠΉΡ€Ρ–ΡˆΡƒΡ‡Ρ–ΡˆΠΈΠΌ Ρ‡ΠΈΠ½ΠΎΠΌ Ρ–
ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ Π½Π΅Π³Π°ΠΉΠ½ΠΎ припинитися. Всі ΠΊΡ€Π°Ρ—Π½ΠΈ, які Π²ΠΈΠΊΠΎΡ€ΠΈΡΡ‚ΠΎΠ²ΡƒΡŽΡ‚ΡŒ Π²Ρ–ΠΉΡΡŒΠΊΠΎΠ²Ρ– альянси для постійного
коригування сфСр інтСрСсів Ρ– постійно ТокСя для Ρ‚Π°ΠΊΡ‚ΠΈΡ‡Π½ΠΈΡ… Ρ– стратСгічних ΠΏΠ΅Ρ€Π΅Π²Π°Π³, ΠΏΠΎΠ²ΠΈΠ½Π½Ρ– ΠΏΡ€ΠΈΠΏΠΈΠ½ΠΈΡ‚ΠΈ
свою Π΄Π΅ΡΡ‚Π°Π±Ρ–Π»Ρ–Π·ΡƒΡŽΡ‡Ρƒ ΠΏΡ€Π°ΠΊΡ‚ΠΈΠΊΡƒ. Всі сторони ΠΏΠΎΠ²ΠΈΠ½Π½Ρ– прокинутися: ΠΌΠΈ Π½Π΅ Ρ‚Ρ–Π»ΡŒΠΊΠΈ всі дивлямося Π² ядСрну
ΠΏΡ€Ρ–Ρ€Π²Ρƒ Ρ‡Π΅Ρ€Π΅Π· Ρ‚Ρ€ΠΈΠ²Π°Π»Ρ– Π½Π΅Π²Π΄Π°Π»Ρ– спроби роззброєння – ΠΏΠ»Π°Π½Π΅Ρ‚Π° Ρ‚Π°ΠΊΠΎΠΆ Π·Π½Π°Ρ…ΠΎΠ΄ΠΈΡ‚ΡŒΡΡ Π² Π»Π΅Ρ‰Π°Ρ‚Π°Ρ…
Π½Π΅ΠΊΠΎΠ½Ρ‚Ρ€ΠΎΠ»ΡŒΠΎΠ²Π°Π½ΠΎΡ— ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΡ— спіралі, яка ΠΏΡ€Π°ΠΊΡ‚ΠΈΡ‡Π½ΠΎ Π½Π°ΠΏΠ΅Π²Π½ΠΎ Π·Ρ€ΠΎΠ±ΠΈΡ‚ΡŒ Ρ—Ρ— Π½Π΅ΠΏΡ€ΠΈΠ΄Π°Ρ‚Π½ΠΎΡŽ для Тиття Π²
Ρ†ΡŒΠΎΠΌΡƒ ΠΏΠΎΠΊΠΎΠ»Ρ–Π½Π½Ρ–. Eurosolar, Π„Π²Ρ€ΠΎΠΏΠ΅ΠΉΡΡŒΠΊΠ° асоціація Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ, Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ” Π΄ΠΎ ΠΏΠΎΠ²Π½ΠΎΠ³ΠΎ Ρ–
швидкого ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Ρƒ Π΄ΠΎ Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ, Ρ‰ΠΎΠ± покласти ΠΊΡ€Π°ΠΉ залСТності Π„Π²Ρ€ΠΎΠΏΠΈ Ρ‚Π° світу Π²Ρ–Π΄
Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΏΠ°Π»ΠΈΠ²Π°. Π¦Π΅ ΠΏΡ€ΠΈΠ·Π²Π΅Π»ΠΎ Π΄ΠΎ нСскінчСнної Π²Ρ–ΠΉΠ½ΠΈ, Π½Π΅Π²ΠΈΠΌΠΎΠ²Π½ΠΈΡ… ΡΡ‚Ρ€Π°ΠΆΠ΄Π°Π½ΡŒ Ρ– ΡΡ‚Ρ€Π°ΠΆΠ΄Π°Π½ΡŒ протягом
Π±Ρ–Π»ΡŒΡˆ Π½Ρ–ΠΆ століття, Π²Ρ–Π΄Π²ΠΎΠ»Ρ–ΠΊΠ°ΡŽΡ‡ΠΈ Π²Ρ–Π΄ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΡΠΏΡ–Π»ΡŒΠ½ΠΎΠ³ΠΎ Π²ΠΎΡ€ΠΎΠ³Π°, який ΠΏΠΎΠ³Ρ€ΠΎΠΆΡƒΡ” споТивати всС:
прискорСнС нагрівання ΠΊΠ»Ρ–ΠΌΠ°Ρ‚Ρƒ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡƒ ΠΏΠ°Π»ΠΈΠ²Ρ–. Π£ΠΊΡ€Π°Ρ—Π½ΡΡŒΠΊΠ° сСкція EUROSOLAR Ρ‚Π° Ρ—Ρ— ΠΌΠ΅Ρ€Π΅ΠΆΡ– Π²ΠΆΠ΅
Π΄Π°Π²Π½ΠΎ Π²ΠΈΡΡ‚ΡƒΠΏΠ°ΡŽΡ‚ΡŒ Π·Π° Π½ΠΎΠ²Ρƒ Π΅ΠΏΠΎΡ…Ρƒ Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½ΠΎΡ— Π΅Π½Π΅Ρ€Π³Π΅Ρ‚ΠΈΠΊΠΈ Ρƒ Π‘Ρ…Ρ–Π΄Π½Ρ–ΠΉ Π„Π²Ρ€ΠΎΠΏΡ–. Π Π°Π·ΠΎΠΌ Π· усіма Ρ–Π½ΡˆΠΈΠΌΠΈ
нашими сСкціями Ρ‚Π° Ρ‡Π»Π΅Π½Π°ΠΌΠΈ Π½Π° Ρ”Π²Ρ€ΠΎΠΏΠ΅ΠΉΡΡŒΠΊΠΎΠΌΡƒ ΠΊΠΎΠ½Ρ‚ΠΈΠ½Π΅Π½Ρ‚Ρ–, Π²Ρ–Π΄ Росії Π΄ΠΎ НідСрландів, Π° Ρ‚Π°ΠΊΠΎΠΆ Π²Ρ–Π΄
Π’ΡƒΡ€Π΅Ρ‡Ρ‡ΠΈΠ½ΠΈ Π΄ΠΎ Π”Π°Π½Ρ–Ρ—, EUROSOLAR ΠΏΡ€ΠΎΠΏΠΎΠ½ΡƒΡ” Ρ†ΡŽ ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½Ρƒ ΠΌΠΈΡ€Π½Ρƒ ΠΏΠ»Π°Ρ‚Ρ„ΠΎΡ€ΠΌΡƒ. ΠŸΡ€ΠΎΡ„. ΠŸΡ–Ρ‚Π΅Ρ€ Π”Ρ€ΠΎΡƒΠ΄ΠΆ, ΠŸΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚
EUROSOLAR: β€ž Настав час для ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΡ— ΠΌΠΈΡ€Π½ΠΎΡ— Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚Ρ–Ρ—, Ρ‰ΠΎΠ± протистояти ΡΠΏΡ–Π»ΡŒΠ½ΠΎΠΌΡƒ Π²ΠΎΡ€ΠΎΠ³Ρƒ
ΠΊΠΎΠΆΠ½ΠΎΠ³ΠΎ: ΠΏΠ΅Ρ€Π΅Π΄ΠΎΠ²Ρ–ΠΉ дСстабілізації Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠ³ΠΎ ΠΊΠ»Ρ–ΠΌΠ°Ρ‚Ρƒ. Π¦Π΅ ΠΎΠ΄Π½Π° Π· дСсяти Π΄Ρ–ΠΉ, прСдставлСних EUROSOLAR
як основний порядок Π΄Π΅Π½Π½ΠΈΠΉ нашого часу. β€œ Π— ΠΌΠΎΠΌΠ΅Π½Ρ‚Ρƒ свого заснування Π² 1988 Ρ€ΠΎΡ†Ρ– EUROSOLAR ΠΏΡ€Π°Ρ†ΡŽΠ²Π°Π²
Π½Π°Π΄ припинСнням Π²Ρ–ΠΉΠ½ΠΈ Π½Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎΠΌΡƒ ΠΏΠ°Π»ΠΈΠ²Ρ– ΡˆΠ»ΡΡ…ΠΎΠΌ Π²Π΅Π»ΠΈΠΊΠΎΠ³ΠΎ ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Ρƒ Π½Π° 100% Π²Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½Ρƒ Π΅Π½Π΅Ρ€Π³Ρ–ΡŽ. Π—Π°
словами Π“Π΅Ρ€ΠΌΠ°Π½Π° Π¨ΠΈΡ€Π° ( 1944-2010), засновника EUROSOLAR: Β« Π’Ρ–Π΄Π½ΠΎΠ²Π»ΡŽΠ²Π°Π½Ρ– Π΄ΠΆΠ΅Ρ€Π΅Π»Π° Π΅Π½Π΅Ρ€Π³Ρ–Ρ— ΡΡ‚Π²ΠΎΡ€ΡŽΡŽΡ‚ΡŒ
ΠΌΠΈΡ€ Β». Π•ΠΏΠΎΡ…Π° Π²ΠΈΠΊΠΎΠΏΠ½ΠΎ-ядСрних Π·Π°Π³Ρ€ΠΎΠ· ΠΏΠΎΠ²ΠΈΠ½Π½Π° закінчитися, ΠΏΠΎΠ²ΠΈΠ½Π΅Π½ початися Π΅ΠΊΠ·ΠΈΡΡ‚Π΅Π½Ρ†Ρ–Π°Π»ΡŒΠ½ΠΈΠΉ фокус:
www.earthdecade.org. EUROSOLAR Ρ‚Π°ΠΊΠΎΠΆ Π·Π°ΠΊΠ»ΠΈΠΊΠ°Ρ” Π΄ΠΎ Π·ΠΌΡ–Π½ΠΈ мислСння Π² Π±Ρ–ΠΊ ΠΊΠ»Ρ–ΠΌΠ°Ρ‚ΠΈΡ‡Π½ΠΎΡ— ΠΌΠΈΡ€Π½ΠΎΡ—
Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚Ρ–Ρ—, яка Π²ΠΈΠ·Π½Π°Ρ” Ρ– Π±ΠΎΡ€Π΅Ρ‚ΡŒΡΡ Π· Π²ΠΈΠΊΠΎΠΏΠ½ΠΈΠΌΠΈ залСТностями як Π½Π°ΠΉΠ±Ρ–Π»ΡŒΡˆΠΈΠΉ ΡΠΏΡ–Π»ΡŒΠ½ΠΈΠΉ Π²ΠΎΡ€ΠΎΠ³ Π»ΡŽΠ΄ΡΡ‚Π²Π°.
https:
//www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom
cy/ ΠžΡ‚ΠΊΡ€Ρ‹Ρ‚ΠΎΠ΅ письмо ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚Π°ΠΌ ΠŸΡƒΡ‚ΠΈΠ½Ρƒ, Π‘Π°ΠΉΠ΄Π΅Π½Ρƒ, ЗСлСнскому ΠΈ Π›ΡƒΠΊΠ°ΡˆΠ΅Π½ΠΊΠΎ: EUROSOLAR, ЕвропСйская
ассоциация возобновляСмой энСргСтики, ΠΏΡ€ΠΈΠ·Ρ‹Π²Π°Π΅Ρ‚ ΠΊ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎΠΌΡƒ ΠΏΡ€Π΅ΠΊΡ€Π°Ρ‰Π΅Π½ΠΈΡŽ климатичСского огня ΠΈ
Π·Π°ΠΊΠ»ΡŽΡ‡Π΅Π½ΠΈΡŽ постоянного климатичСского ΠΌΠΈΡ€Π½ΠΎΠ³ΠΎ соглашСния ΠΏΠΎ всСй Восточной Π•Π²Ρ€ΠΎΠΏΠ΅ – ΠΈ, Ρ‚Π°ΠΊΠΈΠΌ
ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ, ΠΊ Π½Π°Ρ‡Π°Π»Ρƒ многостороннСй климатичСской ΠΌΠΈΡ€Π½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚ΠΈΠΈ. НападСниС российских Π²ΠΎΠ΅Π½Π½Ρ‹Ρ… Π½Π°
украинский Π½Π°Ρ€ΠΎΠ΄ ΠΈ Π΅Π³ΠΎ ΠΏΡ€Π°Π²ΠΈΡ‚Π΅Π»ΡŒΡΡ‚Π²ΠΎ Π΄ΠΎΠ»ΠΆΠ½ΠΎ Π±Ρ‹Ρ‚ΡŒ осуТдСно самым Ρ€Π΅ΡˆΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹ΠΌ ΠΎΠ±Ρ€Π°Π·ΠΎΠΌ ΠΈ Π½Π΅ΠΌΠ΅Π΄Π»Π΅Π½Π½ΠΎ
остановлСно. ВсС страны, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡŽΡ‚ Π²ΠΎΠ΅Π½Π½Ρ‹Π΅ ΡΠΎΡŽΠ·Ρ‹ для постоянной ΠΊΠΎΡ€Ρ€Π΅ΠΊΡ‚ΠΈΡ€ΠΎΠ²ΠΊΠΈ своих сфСр
интСрСсов ΠΈ постоянной Π±ΠΎΡ€ΡŒΠ±Ρ‹ Π·Π° тактичСскоС ΠΈ стратСгичСскоС прСимущСство, Π΄ΠΎΠ»ΠΆΠ½Ρ‹ ΠΏΡ€Π΅ΠΊΡ€Π°Ρ‚ΠΈΡ‚ΡŒ свою
Π΄Π΅ΡΡ‚Π°Π±ΠΈΠ»ΠΈΠ·ΠΈΡ€ΡƒΡŽΡ‰ΡƒΡŽ ΠΏΡ€Π°ΠΊΡ‚ΠΈΠΊΡƒ. ВсС Π²ΠΎΠ²Π»Π΅Ρ‡Π΅Π½Π½Ρ‹Π΅ стороны Π΄ΠΎΠ»ΠΆΠ½Ρ‹ ΠΏΡ€ΠΎΡΠ½ΡƒΡ‚ΡŒΡΡ: Мало Ρ‚ΠΎΠ³ΠΎ, Ρ‡Ρ‚ΠΎ ΠΌΡ‹ всС
смотрим Π² ΡΠ΄Π΅Ρ€Π½ΡƒΡŽ Π±Π΅Π·Π΄Π½Ρƒ ΠΈΠ·-Π·Π° Π΄Π»ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Ρ… Π½Π΅ΡƒΠ΄Π°Ρ‡Π½Ρ‹Ρ… ΠΏΠΎΠΏΡ‹Ρ‚ΠΎΠΊ разоруТСния – ΠΏΠ»Π°Π½Π΅Ρ‚Π° Ρ‚Π°ΠΊΠΆΠ΅ находится Π²
Π½Π΅ΠΊΠΎΠ½Ρ‚Ρ€ΠΎΠ»ΠΈΡ€ΡƒΠ΅ΠΌΠΎΠΉ климатичСской спирали, которая ΠΏΠΎΡ‡Ρ‚ΠΈ навСрняка сдСлаСт Π΅Π΅ Π½Π΅ΠΏΡ€ΠΈΠ³ΠΎΠ΄Π½ΠΎΠΉ для ΠΆΠΈΠ·Π½ΠΈ
ΡƒΠΆΠ΅ Π² этом ΠΏΠΎΠΊΠΎΠ»Π΅Π½ΠΈΠΈ. EUROSOLAR, ЕвропСйская ассоциация возобновляСмых источников энСргии,
ΠΏΡ€ΠΈΠ·Ρ‹Π²Π°Π΅Ρ‚ ΠΊ ΠΏΠΎΠ»Π½ΠΎΠΌΡƒ ΠΈ быстрому ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Ρƒ Π½Π° возобновляСмыС источники энСргии, Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΠΎΠ»ΠΎΠΆΠΈΡ‚ΡŒ ΠΊΠΎΠ½Π΅Ρ†
зависимости Π•Π²Ρ€ΠΎΠΏΡ‹ ΠΈ всСго ΠΌΠΈΡ€Π° ΠΎΡ‚ ископаСмого Ρ‚ΠΎΠΏΠ»ΠΈΠ²Π°. Она ΠΏΡ€ΠΈΠ²Π΅Π»Π° ΠΊ бСсконСчным Π²ΠΎΠΉΠ½Π°ΠΌ,
Π½Π΅Π²Ρ‹Ρ€Π°Π·ΠΈΠΌΡ‹ΠΌ страданиям ΠΈ Π½Π΅ΡΡ‡Π°ΡΡ‚ΡŒΡΠΌ Π½Π° протяТСнии Π±ΠΎΠ»Π΅Π΅ Π²Π΅ΠΊΠ°, отвлСкая нас ΠΎΡ‚ ΠΎΠ΄Π½ΠΎΠ³ΠΎ ΠΎΠ±Ρ‰Π΅Π³ΠΎ Π²Ρ€Π°Π³Π°,
ΠΊΠΎΡ‚ΠΎΡ€Ρ‹ΠΉ ΡƒΠ³Ρ€ΠΎΠΆΠ°Π΅Ρ‚ ΠΏΠΎΠ³Π»ΠΎΡ‚ΠΈΡ‚ΡŒ всСх нас: ускорСнного глобального потСплСния, Π²Ρ‹Π·Π²Π°Π½Π½ΠΎΠ³ΠΎ ископаСмым
Ρ‚ΠΎΠΏΠ»ΠΈΠ²ΠΎΠΌ. Украинская сСкция EUROSOLAR ΠΈ Π΅Π΅ сСти Π΄Π°Π²Π½ΠΎ Π²Ρ‹ΡΡ‚ΡƒΠΏΠ°ΡŽΡ‚ Π·Π° Π½ΠΎΠ²ΡƒΡŽ эру с возобновляСмыми
источниками энСргии Π² Восточной Π•Π²Ρ€ΠΎΠΏΠ΅. ВмСстС со всСми Π΄Ρ€ΡƒΠ³ΠΈΠΌΠΈ нашими сСкциями ΠΈ Ρ‡Π»Π΅Π½Π°ΠΌΠΈ ΠΏΠΎ всСму
СвропСйскому ΠΊΠΎΠ½Ρ‚ΠΈΠ½Π΅Π½Ρ‚Ρƒ, ΠΎΡ‚ России Π΄ΠΎ НидСрландов ΠΈ ΠΎΡ‚ Π’ΡƒΡ€Ρ†ΠΈΠΈ Π΄ΠΎ Π”Π°Π½ΠΈΠΈ, EUROSOLAR ΠΏΡ€Π΅Π΄Π»Π°Π³Π°Π΅Ρ‚ эту
ΠΏΠ»Π°Ρ‚Ρ„ΠΎΡ€ΠΌΡƒ ΠΌΠΈΡ€Π° ΠΊΠ»ΠΈΠΌΠ°Ρ‚Ρƒ. ΠŸΡ€ΠΎΡ„Π΅ΡΡΠΎΡ€ ΠŸΠ΅Ρ‚Π΅Ρ€ Π”Ρ€ΠΎΠ³Π΅, ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚ EUROSOLAR: β€ž Настало врСмя для
климатичСской ΠΌΠΈΡ€Π½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚ΠΈΠΈ, Ρ‡Ρ‚ΠΎΠ±Ρ‹ ΠΏΡ€ΠΎΡ‚ΠΈΠ²ΠΎΡΡ‚ΠΎΡΡ‚ΡŒ ΠΎΠ±Ρ‰Π΅ΠΌΡƒ для всСх Π²Ρ€Π°Π³Ρƒ: дСстабилизации ΠΊΠ»ΠΈΠΌΠ°Ρ‚Π°
Π·Π° счСт ΠΏΠ΅Ρ€Π΅Π΄ΠΎΠ²ΠΎΠ³ΠΎ ископаСмого Ρ‚ΠΎΠΏΠ»ΠΈΠ²Π°. Π­Ρ‚ΠΎ ΠΎΠ΄Π½ΠΎ ΠΈΠ· дСсяти дСйствий, ΠΊΠΎΡ‚ΠΎΡ€Ρ‹Π΅ EUROSOLAR прСдставляСт
ΠΊΠ°ΠΊ ΡΠ°ΠΌΡƒΡŽ Π²Π°ΠΆΠ½ΡƒΡŽ повСстку дня нашСго Π²Ρ€Π΅ΠΌΠ΅Π½ΠΈ β€œ. Π”ΠΎΠΊΡ‚ΠΎΡ€ Π‘Ρ€ΠΈΠ³ΠΈΡ‚Ρ‚Π΅ Π¨ΠΌΠΈΠ΄Ρ‚, Π²ΠΈΡ†Π΅-ΠΏΡ€Π΅Π·ΠΈΠ΄Π΅Π½Ρ‚ ΠΈ Ρ‡Π»Π΅Π½
правлСния EUROSOLAR ГСрмания: β€ž Наступило врСмя возобновляСмого ΠΌΠΈΡ€Π°, Ρ‡Π°ΡΡ‚ΡŒ нашСй ΠΏΡ€ΠΎΠ³Ρ€Π°ΠΌΠΌΡ‹ β€ž
ВозобновляСмоС дСсятилСтиС β€œ. Он выступаСт Π·Π° пСрСосмыслСниС ΠΈ ΠΌΠΈΡ€Π½Ρ‹Π΅ дСйствия Π²ΠΎ имя нашСго ΠΎΠ±Ρ‰Π΅Π³ΠΎ
Π±ΡƒΠ΄ΡƒΡ‰Π΅Π³ΠΎ. Π‘ ΠΌΠΎΠΌΠ΅Π½Ρ‚Π° своСго основания Π² 1988 Π³ΠΎΠ΄Ρƒ компания EUROSOLAR Ρ€Π°Π±ΠΎΡ‚Π°Π΅Ρ‚ Π½Π°Π΄ Ρ‚Π΅ΠΌ, Ρ‡Ρ‚ΠΎΠ±Ρ‹
ΠΏΠΎΠ»ΠΎΠΆΠΈΡ‚ΡŒ ΠΊΠΎΠ½Π΅Ρ† Π²ΠΎΠΉΠ½Π°ΠΌ Π·Π° ископаСмоС Ρ‚ΠΎΠΏΠ»ΠΈΠ²ΠΎ ΠΏΡƒΡ‚Π΅ΠΌ ΠΌΠ°ΡΡˆΡ‚Π°Π±Π½ΠΎΠ³ΠΎ ΠΏΠ΅Ρ€Π΅Ρ…ΠΎΠ΄Π° Π½Π° 100% возобновляСмыС
источники энСргии. По словам Π“Π΅Ρ€ΠΌΠ°Π½Π° Π¨Π΅Π΅Ρ€Π° ( 1944-2010), основатСля EUROSOLAR: β€ž ВозобновляСмыС
источники энСргии ΡΠΎΠ·Π΄Π°ΡŽΡ‚ ΠΌΠΈΡ€ β€œ. Π’Π΅ΠΊ ископаСмо-ядСрных ΡƒΠ³Ρ€ΠΎΠ· Π΄ΠΎΠ»ΠΆΠ΅Π½ Π·Π°ΠΊΠΎΠ½Ρ‡ΠΈΡ‚ΡŒΡΡ, Π΄ΠΎΠ»ΠΆΠ½Π° Π½Π°Ρ‡Π°Ρ‚ΡŒΡΡ
ΡΠΊΠ·ΠΈΡΡ‚Π΅Π½Ρ†ΠΈΠ°Π»ΡŒΠ½Π°Ρ ориСнтация: https: //www.earthdecade.org. EUROSOLAR ΠΏΡ€ΠΈΠ·Ρ‹Π²Π°Π΅Ρ‚ ΠΊ ΠΏΠ΅Ρ€Π΅ΠΎΡΠΌΡ‹ΡΠ»Π΅Π½ΠΈΡŽ Π²
сторону климатичСской ΠΌΠΈΡ€Π½ΠΎΠΉ Π΄ΠΈΠΏΠ»ΠΎΠΌΠ°Ρ‚ΠΈΠΈ, которая ΠΏΡ€ΠΈΠ·Π½Π°Π΅Ρ‚ ΠΈ борСтся с ископаСмой Π·Π°Π²ΠΈΡΠΈΠΌΠΎΡΡ‚ΡŒΡŽ ΠΊΠ°ΠΊ
Π²Π΅Π»ΠΈΡ‡Π°ΠΉΡˆΠΈΠΌ ΠΎΠ±Ρ‰ΠΈΠΌ Π²Ρ€Π°Π³ΠΎΠΌ чСловСчСства.https:
//www.eurosolar.org/en/2022/02/01/regenerative-earth-decade-eurosolars-call-for-climate-peace-diplom
cy/ Independent of political parties, institutions, companies and interest groups, EUROSOLAR has
been developing and stimulating political and economic action drafts and concepts for the
introduction of renewable energies since 1988. This ranges from market introduction strategies to
proposals for further research and development policy, from tax policy subsidies to arms conversion
with solar energy, from the contribution of solar energy for the Global South to agricultural,
transport and construction policy. EuropΓ€ische Vereinigung fΓΌr Erneuerbare Energien e. V.
articles_df = articles_df[articles_df["lang"] == "en"]

Our exploration revealed a small number of articles containing non-English content (some in German and 1 with a Russian section). Since most LLMs and embedding models are primarily trained on English text, removing these articles ensures compatibility with our chosen models for this notebook. For simplicity, we'll only focus on supporting English queries and responses within this RAG pipeline.

Challenges of Multilingual RAG PipelinesΒΆ

Introducing multilingual capabilities into a RAG pipeline presents a layer of complexity. Here's a breakdown of some key challenges:

Characters, Tokens and WordsΒΆ

Let us further analyze the contents of the articles. However, before we do so let us define the meaning of characters, tokens and words:

sns.histplot(articles_df["article"].map(len), kde=True)

plt.title("Amount of characters in articles")
plt.xlabel("Amount of characters")
plt.ylabel("Number of articles")
median_char_len = articles_df["article"].map(len).median()
mean_char_len = articles_df["article"].map(len).mean()
plt.axvline(median_char_len, color='r', linestyle='--', label=f"Median character amount: {median_char_len:.2f}")
plt.axvline(mean_char_len, color='g', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
sns.histplot(articles_df["article"].map(lambda x: len(x.split())), kde=True)

plt.title("Amount of words in articles")
plt.xlabel("Amount of words")
plt.ylabel("Number of articles")
median_word_len = articles_df["article"].map(lambda x: len(x.split())).median()
mean_word_len = articles_df["article"].map(lambda x: len(x.split())).mean()
plt.axvline(median_word_len, color='r', linestyle='--', label=f"Median word amount: {median_word_len:.2f}")
plt.axvline(mean_word_len, color='g', linestyle='--', label=f"Mean word amount: {mean_word_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
nlp = English()
tokenizer = nlp.tokenizer

sns.histplot(articles_df["article"].map(lambda x: len(tokenizer(x))), kde=True)

plt.title("Amount of tokens in articles")
plt.xlabel("Amount of tokens")
plt.ylabel("Number of articles")
median_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).median()
mean_token_len = articles_df["article"].map(lambda x: len(tokenizer(x))).mean()
plt.axvline(median_token_len, color='r', linestyle='--', label=f"Median token amount: {median_token_len:.2f}")
plt.axvline(mean_token_len, color='g', linestyle='--', label=f"Mean token amount: {mean_token_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
all_tokens = [token.text for article in articles_df["article"] for token in tokenizer(article)]
# remove non-alphabetic tokens such as punctuation
alpha_tokens = [token for token in all_tokens if token.isalpha()]
alpha_tokens = [token.lower() for token in alpha_tokens]
alpha_token_counts = Counter(alpha_tokens)

sns.barplot(
    x=[count for token, count in alpha_token_counts.most_common(20)],
    y=[token for token, count in alpha_token_counts.most_common(20)],
    hue=[token for token, count in alpha_token_counts.most_common(20)]
)

plt.title("Most common alphabetic tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
No description has been provided for this image
# remove stopwords such as 'the', 'a', 'and'
non_stop_tokens = [token for token in alpha_tokens if not nlp.vocab[token].is_stop]
non_stop_token_counts = Counter(non_stop_tokens)

sns.barplot(
    x=[count for token, count in non_stop_token_counts.most_common(20)],
    y=[token for token, count in non_stop_token_counts.most_common(20)],
    hue=[token for token, count in non_stop_token_counts.most_common(20)]
)

plt.title("Most common non-stopword tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
No description has been provided for this image

As one would expect in a dataset of cleantech news articles most of the tokens that are not punctation or stopwords revolve around the subjects of energy, climate, and technology. This is a good sign that the dataset is relevant to the topic at hand. The "s" token comes up frequently, which is likely due to the possessive form of words. With an average of around 700 words per article, we can expect a good amount of information to be present in each article and an average reading time of around 3-4 minutes.

Flesch Reading Ease ScoreΒΆ

The Flesch Reading Ease Score is a tool used to evaluate how easy it is to understand a text based on the length of sentences and the number of syllables per word. Scores can range from -100 (very difficult to read) to 100 (very easy to read). This metric can be useful for assessing the readability of our articles and ensuring they are accessible to a broad audience.

articles_df["readability"] = articles_df["article"].apply(flesch_reading_ease)

sns.histplot(articles_df["readability"], kde=True)

plt.title("Flesch Reading Ease of articles")
plt.xlabel("Flesch Reading Ease")
plt.ylabel("Number of articles")
mean_readability = articles_df["readability"].mean()
plt.axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
domains = articles_df["domain"].unique()

# Setup the subplots based on the number of domains
plots_per_row = 3
num_rows = (len(domains) + 2) // plots_per_row 
plot_height = 6 
fig, axes = plt.subplots(num_rows, plots_per_row, figsize=(plot_height * plots_per_row, plot_height * num_rows))
axes = axes.flatten()  # Flatten the axes array for easier iteration

# Plot for each domain
for i, domain in enumerate(domains):
    domain_articles = articles_df[articles_df["domain"] == domain]
    sns.histplot(domain_articles["readability"], kde=True, ax=axes[i], bins=30)
    axes[i].set_title(f'Readability of {domain}')
    axes[i].set_xlabel('Flesch Reading Ease Score')
    axes[i].set_ylabel("Number of articles")
    mean_readability = domain_articles["readability"].mean()
    axes[i].axvline(mean_readability, color='g', linestyle='--', label=f"Mean readability: {mean_readability:.2f}")

# remove the empty plots
for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()
No description has been provided for this image

To gauge the readability of our articles, we calculated the Flesch-Kincaid Reading Ease Score. The average score of around 45 indicates a "fairly easy" reading level, which is positive news. This suggests the content is likely accessible to a broad audience and, consequently, understandable by our RAG pipeline as well.

Our analysis revealed a consistent average Flesch-Kincaid Reading Ease Score across most of the identified domains, with minor variations. This indicates a relatively consistent level of readability across different publishers within the dataset.

Finally we will save the cleaned dataset to a new file in the data/silver folder.

silver_folder = data_folder / "silver"
if not silver_folder.exists():
    silver_folder.mkdir()

articles_df.to_csv(silver_folder / "articles.csv", index=False)

Evaluation QuestionsΒΆ

Next we will analyze the provided evaluation questions and ensure that they match the content of the articles.

human_eval_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 23 entries, 1 to 23
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   question_id     23 non-null     int64 
 1   question        23 non-null     object
 2   relevant_chunk  23 non-null     object
 3   article_url     23 non-null     object
dtypes: int64(1), object(3)
memory usage: 920.0+ bytes
human_eval_df.rename(columns={"relevant_chunk":"relevant_section","article_url": "url"}, inplace=True)
human_eval_df.drop(columns=["question_id"], inplace=True)
human_eval_df.head()
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... https://www.sgvoice.net/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... https://www.sgvoice.net/policy/25396/eu-seeks-...
3 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... https://www.pv-magazine.com/2023/02/02/europea...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... https://www.sgvoice.net/policy/25396/eu-seeks-...
5 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... https://cleantechnica.com/2023/05/08/general-m...
sns.histplot(human_eval_df["question"].map(len), kde=True)
plt.title("Question Character Length Distribution")
plt.xlabel("Character Length")
plt.ylabel("Count")
mean_char_len = human_eval_df["question"].map(len).mean()
plt.axvline(mean_char_len, color='r', linestyle='--', label=f"Mean character amount: {mean_char_len:.2f}")
plt.legend()
plt.show()
No description has been provided for this image
missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... https://www.sgvoice.net/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... https://www.sgvoice.net/policy/25396/eu-seeks-...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... https://www.sgvoice.net/policy/25396/eu-seeks-...

Our exploration has identified instances where articles linked to specific questions appear to be missing from the dataset. To determine the root cause, let's investigate whether these articles are genuinely absent or if inconsistencies in URL formatting are creating the illusion of missing data. Normalizing the URLs across the dataset will help us differentiate between these two scenarios.

def normalize_url(url: str) -> str:
    url = url.replace("https://", "")
    url = url.replace("http://", "")
    url = url.replace("www.", "")
    url = url.rstrip("/")
    return url

articles_df["url"] = articles_df["url"].map(normalize_url)
human_eval_df["url"] = human_eval_df["url"].map(normalize_url)

missing_articles = human_eval_df.copy()
missing_articles = missing_articles[~human_eval_df["url"].isin(articles_df["url"])]
missing_articles
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.net/strategy/technology/23971/leclanch...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.net/policy/25396/eu-seeks-competitive-...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.net/policy/25396/eu-seeks-competitive-...

We also know from previous analysis that some duplicate articles from the "energyvoice" domain so we will also normalize these URLs.

missing_articles["url"] = missing_articles["url"].map(lambda x: x.replace("sgvoice.net", "sgvoice.energyvoice.com"))
missing_articles[~missing_articles["url"].isin(articles_df["url"])]
question relevant_section url
example_id
human_eval_df.loc[missing_articles.index, "url"] = missing_articles["url"]
human_eval_df[human_eval_df["url"].isin(articles_df["url"])]
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.energyvoice.com/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
3 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... pv-magazine.com/2023/02/02/european-commission...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
5 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... cleantechnica.com/2023/05/08/general-motors-se...
6 Did Colgate-Palmolive enter into PPA agreement... Scout Clean Energy, a Colorado-based renewable... solarindustrymag.com/scout-and-colgate-palmoli...
7 What is the status of ZeroAvia's hydrogen fuel... In December, the US startup ZeroAvia announced... cleantechnica.com/2023/01/02/the-wait-for-hydr...
8 What is the "Danger Season"? As spring turns to summer and the days warm up... cleantechnica.com/2023/05/15/what-does-a-norma...
9 Is Mississipi an anti-ESG state? Mississippi is among two dozen or so states in... cleantechnica.com/2023/05/15/mississippi-takes...
10 Can you hang solar panels on garden fences? Scaling down from the farm to the garden level... cleantechnica.com/2023/05/18/solar-panels-for-...
11 Who develops quality control systems for ocean... Scientists from the Chinese Academy of Science... azocleantech.com/news.aspx?newsID=32873
12 Why are milder winters detrimental for grapes ... Since grapes and apples are perennial species,... azocleantech.com/news.aspx?newsID=33040
13 What are the basic recycling steps for solar p... There are some simple recycling steps that can... azocleantech.com/news.aspx?newsID=33143
14 Why does melting ice contribute to global warm... Whereas white ice reflects the sun's rays, a d... azocleantech.com/news.aspx?newsID=33149
15 Does the Swedish government plan bans on new p... The Swedish government has proposed a ban on n... azocleantech.com/news.aspx?newsID=33174
16 Where do the turbines used in Icelandic geothe... Minister Nishimura mentioned that most geother... thinkgeoenergy.com/japan-and-iceland-agree-on-...
17 Who is the target user for Leapfrog Energy? O’Brien added, β€œSubsurface specialists need fl... thinkgeoenergy.com/seequent-expands-subsurface...
18 What is Agrivoltaics? Agrivoltaics, the integration of food producti... pv-magazine.com/2023/03/31/new-software-modeli...
19 What is Agrivoltaics? Agrivoltaics refers to the conduct of agricult... cleantechnica.com/2022/12/18/agrivoltaics-goes...
20 Why is cannabis cultivation moving indoors? Cannabis cultivation can take place outdoors, ... pv-magazine.com/2023/04/08/high-time-for-solar...
21 What are the obstacles for cannabis producers ... β€œThere are a lot of prevailing headwinds for c... pv-magazine.com/2023/04/08/high-time-for-solar...
22 In 2021, what were the top 3 states in the US ... In 2021, Florida surpassed North Carolina to b... cleantechnica.com/2023/04/10/solar-power-in-fl...
23 Which has the higher absorption coefficient fo... We chose amorphous germanium instead of amorph... pv-magazine.com/2021/01/15/germanium-based-sol...

In the end we are able to find all the articles that are linked to the evaluation questions and have therefore successfully completed our exploratory data analysis and preprocessing.

SubsamplingΒΆ

For faster processing and to reduce the cost of running the notebook we will subsample the dataset to 1000 articles. This will allow us to run the notebook in a reasonable amount of time and still provide meaningful results. Because the distribution of articles across publishers is skewed we will use stratified sampling to ensure that we have a representative sample. We also need to keep in mind that the evaluation questions are linked to specific articles so we need to make sure that these are included in the subsample.

eval_articles_df = articles_df[articles_df["url"].isin(human_eval_df["url"])]
eval_articles_df.head()
title content domain url article lang readability
6780 Leclanché’ s new disruptive battery boosts ene... ['Energy storage company LeclanchΓ© ( SW.LECN) ... energyvoice sgvoice.energyvoice.com/strategy/technology/23... Energy storage company LeclanchΓ© ( SW.LECN) ha... en 43.22
6805 EU seeks competitive boost with Green Deal Ind... ['The EU has presented its β€˜ Green Deal Indust... energyvoice sgvoice.energyvoice.com/policy/25396/eu-seeks-... The EU has presented its β€˜ Green Deal Industri... en 34.70
16367 Agrivoltaics Goes Nuclear On California Prairie ['A decommissioned nuclear power plant from th... cleantechnica cleantechnica.com/2022/12/18/agrivoltaics-goes... A decommissioned nuclear power plant from the ... en 42.00
16402 The Wait For Hydrogen Fuel Cell Electric Aircr... ['The US firm ZeroAvia is one step closer to b... cleantechnica cleantechnica.com/2023/01/02/the-wait-for-hydr... The US firm ZeroAvia is one step closer to bri... en 50.46
16725 Solar Power In Florida ['Many renewable energy endeavors in Florida a... cleantechnica cleantechnica.com/2023/04/10/solar-power-in-fl... Many renewable energy endeavors in Florida are... en 44.75
print(eval_articles_df["url"].unique().shape)
print(human_eval_df["url"].unique().shape)
(21,)
(21,)
def do_stratification(
        df: pd.DataFrame,
        column: str,
        sample_size: int,
        seed: int = 42
) -> pd.DataFrame:
    res_df = df.copy()
    indx = df.groupby(column, group_keys=False)[column].apply(lambda x: x.sample(n=int(sample_size/len(df) * len(x)), random_state=seed)).index.to_list()
    return res_df.loc[indx]
sample_df = do_stratification(articles_df, "domain", 1000, 69)
# if the articles are already in the subsample from the evaluation set, then we remove them, so we just want unique urls
sample_df = sample_df[~sample_df["url"].isin(eval_articles_df["url"])]
sample_df = pd.concat([sample_df, eval_articles_df])
sample_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1011 entries, 38325 to 81779
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        1011 non-null   object 
 1   content      1011 non-null   object 
 2   domain       1011 non-null   object 
 3   url          1011 non-null   object 
 4   article      1011 non-null   object 
 5   lang         1011 non-null   object 
 6   readability  1011 non-null   float64
dtypes: float64(1), object(6)
memory usage: 63.2+ KB
original_domain_counts = articles_df["domain"].value_counts().to_frame()
original_domain_counts = original_domain_counts / original_domain_counts.sum() * 100
domain_counts_df = original_domain_counts.copy()
domain_counts_df["type"] = "Original"


sample_domain_counts = sample_df["domain"].value_counts().to_frame()
sample_domain_counts = sample_domain_counts / sample_domain_counts.sum() * 100
sample_domain_counts["type"] = "Sample"

domain_counts_df = pd.concat([domain_counts_df, sample_domain_counts])
sns.barplot(
    x=domain_counts_df.index,
    y=domain_counts_df["count"],
    hue=domain_counts_df["type"]
)
plt.title("Domain Distribution")
plt.xlabel("Domain")
plt.ylabel("Percentage")
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image

ChunkingΒΆ

Chunking is a crucial step in the RAG pipeline. It involves breaking down the articles into smaller, more manageable pieces.

chunking

There are mainly two reasons for this:

Let's start by getting a better feeling for the most common size of chunks based on the number of characters

def get_lorem_text(num_chars: int) -> str:
    expected_avg_word_len = 3 # on the lower side to be safe
    text = lorem.words(num_chars // expected_avg_word_len)
    return text[:num_chars]
print(wrap_text(get_lorem_text(256)))
nihil rerum debitis fuga optio est modi sunt ratione tempore voluptatem reprehenderit cumque qui
quasi doloribus soluta accusamus similique id obcaecati sit incidunt molestiae eveniet quod
repudiandae laudantium libero voluptas autem harum natus quas volup
print(wrap_text(get_lorem_text(512)))
facere earum laborum amet distinctio nam ipsum quibusdam minus fuga molestiae quis perferendis sed
suscipit animi sequi aliquam nisi cumque nulla deserunt aut in quos sapiente corrupti dolorum enim
modi repellendus at assumenda voluptatibus pariatur quaerat temporibus magnam recusandae numquam
qui error nesciunt quae praesentium quia accusantium dicta nihil soluta voluptas quod excepturi est
deleniti dignissimos expedita exercitationem ut ipsa magni voluptates ratione iure ducimus
voluptatum eum dolores ven
print(wrap_text(get_lorem_text(1024)))
non rem officia beatae dolores consequuntur labore numquam sapiente ipsa nesciunt veniam quas nihil
fugiat hic nisi animi dolorem tempore eum tempora dolore accusantium amet incidunt consectetur
exercitationem id saepe accusamus eaque eligendi atque eveniet voluptates deserunt earum aut
delectus magni quae corporis dolorum laborum dicta totam vel dolor cumque fuga vero voluptatum
quibusdam nam quod temporibus neque aliquam architecto quidem eius suscipit soluta ex ab at cum
adipisci sunt nostrum placeat harum omnis sint nobis ducimus ut facere quia laudantium culpa
obcaecati sequi quo perspiciatis iusto odio minus libero mollitia nulla repellat aperiam eos enim
officiis in asperiores provident porro est et voluptatibus itaque aspernatur a repellendus
praesentium voluptas assumenda quasi qui voluptate autem quisquam velit odit reprehenderit ratione
sit alias natus tenetur repudiandae modi reiciendis nemo debitis laboriosam error recusandae minima
dignissimos molestias ea quis deleniti fugit explicabo ipsum rer
print(wrap_text(get_lorem_text(2048)))
facilis ea cum impedit nemo quo rerum facere temporibus excepturi exercitationem aut incidunt
provident quos dolore iure quae ipsam placeat similique autem voluptatibus voluptate at quasi eius
sapiente id culpa alias dicta nostrum optio quam aperiam officiis fugit repellat illum nam
voluptates velit minus atque doloribus nobis est tempora debitis sunt dolorum vero odio inventore
harum recusandae distinctio aliquam amet consectetur ullam nisi officia cupiditate suscipit
laboriosam nesciunt nulla minima quisquam hic natus tenetur sed non laudantium ab soluta vitae
explicabo vel quidem molestiae praesentium repudiandae sint reprehenderit dolor beatae ad fuga
expedita quod dolorem mollitia magnam labore omnis laborum odit voluptas earum aliquid et assumenda
perspiciatis saepe ratione corporis iusto totam neque cumque ipsa tempore modi molestias
perferendis animi voluptatem quia nihil maxime consequuntur doloremque accusamus iste magni error
veritatis a dignissimos necessitatibus eveniet dolores maiores unde illo libero quis consequatur
voluptatum veniam adipisci delectus pariatur obcaecati enim corrupti deserunt quas eligendi porro
itaque sequi sit reiciendis rem ducimus ipsum commodi accusantium aspernatur ut qui fugiat ex esse
in asperiores eos quaerat quibusdam blanditiis eaque deleniti possimus architecto numquam eum
repellendus quaerat vel veniam temporibus quam dicta blanditiis beatae qui ea non ut nulla quia hic
est vitae maiores magni eligendi nisi error neque fuga ad ducimus impedit aut amet dolor voluptas
explicabo adipisci dolore delectus eaque necessitatibus pariatur tempore consectetur consequuntur
culpa sequi similique perspiciatis fugiat quisquam nesciunt quis laborum dignissimos voluptates
possimus repellat ratione voluptatibus quidem facere in provident deleniti voluptatum rerum
quibusdam ex ipsa commodi distinctio accusamus dolorem tempora ullam ipsum perferendis deserunt
numquam corporis facilis unde voluptatem totam aliquid maxime excepturi mollitia officiis
asperiores iste laudantium atque odit d

Creating the ChunksΒΆ

In this notebook we will be using two different chunking strategies:

To see how different texts get chunked with different strategies and chunk sizes check out the Chunking Visualizer.

def get_recursive_splitter(chunk_size: int, chunk_overlap: int) -> TextSplitter:
    return RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", "(?<=\. )", " ", ""],
        length_function=len,
    )
# the recursive splitter mainly relies on newlines, are there even any? No, so it will focus on sentences.
sample_df["article"].map(lambda x: x.count("\n")).sum()
0
# if we can make use of any device that is better than the CPU, we will use it
device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"

model_kwargs = {'device': device, "trust_remote_code": True}
model_kwargs
{'device': 'cuda', 'trust_remote_code': True}
embedding_models = {
    "mini": HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", model_kwargs=model_kwargs),
    "bge-m3": HuggingFaceEmbeddings(model_name="BAAI/bge-m3", model_kwargs=model_kwargs),
    "gte": HuggingFaceEmbeddings(model_name="Alibaba-NLP/gte-base-en-v1.5", model_kwargs=model_kwargs),
}
recursive_256_splitter = get_recursive_splitter(256, 64)
recursive_1024_splitter = get_recursive_splitter(1024, 128)
semantic_splitter = SemanticChunker(
    embedding_models["mini"], breakpoint_threshold_type="percentile"
)
splitters = {
    "recursive_256": recursive_256_splitter,
    "recursive_1024": recursive_1024_splitter,
    "semantic": semantic_splitter
}
def chunk_documents(df: pd.DataFrame, text_splitter: TextSplitter):
    chunks = []
    id = 0
    for _, row in tqdm(df.iterrows(), total=len(df)):
        article_content = row['article']
        title = row['title']
        # we add the title to the content as it might be relevant to the question
        full_text = title + ": " + article_content
        char_chunks = text_splitter.split_text(full_text)
        for chunk in char_chunks:
                id += 1
                # add metadata to the chunk for potential later use
                metadata = {
                    'title': row['title'],
                    'url': row['url'],
                    'domain': row['domain'],
                    'id': id,
                }
                chunks.append(Document(
                    page_content=chunk,
                    metadata=metadata,
                ))
    return chunks
chunks_folder = silver_folder / "chunks"
if not chunks_folder.exists():
    chunks_folder.mkdir()
def get_or_create_chunks(df: pd.DataFrame, text_splitter: TextSplitter, splitter_name: str) -> List[Document]:
    chunks_file = chunks_folder / f"{splitter_name}_chunks.json"
    if chunks_file.exists():
        with open(chunks_file, "r") as file:
            chunks = [Document(**chunk) for chunk in json.load(file)]
        print(f"Loaded {len(chunks)} chunks from {chunks_file}")
    else:
        chunks = chunk_documents(df, text_splitter)
        with open(chunks_file, "w") as file:
            json.dump([doc.dict() for doc in chunks], file, indent=4)
        print(f"Saved {len(chunks)} chunks to {chunks_file}")
    return chunks
chunks = {}
for splitter_name, splitter in splitters.items():
    chunks[splitter_name] = get_or_create_chunks(sample_df, splitter, splitter_name)
Loaded 25399 chunks from data/silver/chunks/recursive_256_chunks.json
Loaded 5754 chunks from data/silver/chunks/recursive_1024_chunks.json
Loaded 3146 chunks from data/silver/chunks/semantic_chunks.json

Now that we have created and saved the chunks we can analyze them. We can already see above that the semantic chunks are generally larger than the recursive chunks.

Analyzing the ChunksΒΆ

Let's start by looking at the first chunk of the first article to get a feeling for what the chunks look like depending on the chunking strategy and then we will look at the distribution of the chunk sizes and the number of chunks per article.

for splitter_name, splitter_chunks in chunks.items():
    print(f"{splitter_name} chunks:")
    print(wrap_text(splitter_chunks[0].page_content, char_per_line=150))
    print()
recursive_256 chunks:
Leclanché’ s new disruptive battery boosts energy density: Energy storage company LeclanchΓ© ( SW.LECN) has designed a new battery cell that uses less
cobalt and boosts energy density by 20%. The company says it is also produced in an environmentally

recursive_1024 chunks:
Leclanché’ s new disruptive battery boosts energy density: Energy storage company LeclanchΓ© ( SW.LECN) has designed a new battery cell that uses less
cobalt and boosts energy density by 20%. The company says it is also produced in an environmentally friendly way, making it more recyclable or easy
to dispose of at end-of-life. LeclanchΓ© said it has developed an environmentally friendly way to produce lithium-ion ( Li-ion) batteries. It has
replaced highly toxic organic solvents, commonly used in the production process, with a water-based process to make nickel-manganese-cobalt-aluminium
cathodes ( NMCA). Organic solvents, such as N-methyl pyrrolidone ( NMP), are highly toxic and harmful to the environment. The use of NMP has been
restricted by the European Commission, having been added to the list of Substances of Very High Concern, which can have serious irreversible effects
on human health and the environment. Besides being technically simpler, eliminating the use of organic solvents also eliminates the risk

semantic chunks:
Leclanché’ s new disruptive battery boosts energy density: Energy storage company LeclanchΓ© ( SW.LECN) has designed a new battery cell that uses less
cobalt and boosts energy density by 20%. The company says it is also produced in an environmentally friendly way, making it more recyclable or easy
to dispose of at end-of-life. LeclanchΓ© said it has developed an environmentally friendly way to produce lithium-ion ( Li-ion) batteries. It has
replaced highly toxic organic solvents, commonly used in the production process, with a water-based process to make nickel-manganese-cobalt-aluminium
cathodes ( NMCA). Organic solvents, such as N-methyl pyrrolidone ( NMP), are highly toxic and harmful to the environment. The use of NMP has been
restricted by the European Commission, having been added to the list of Substances of Very High Concern, which can have serious irreversible effects
on human health and the environment. Besides being technically simpler, eliminating the use of organic solvents also eliminates the risk of
explosion, making the production process safer for employees. LeclanchΓ© claims to be a global pioneer in the field, having used aqueous binders in
its for over a decade.

def plot_chunk_lengths(chunks: List[Document], title: str):
    sns.histplot([len(chunk.page_content) for chunk in chunks], kde=True)
    plt.title(title)
    plt.xlabel("Chunk length")
    plt.ylabel("Number of chunks")
    median_chunk_len = np.median([len(chunk.page_content) for chunk in chunks])
    mean_chunk_len = np.mean([len(chunk.page_content) for chunk in chunks])
    plt.axvline(median_chunk_len, color='r', linestyle='--', label=f"Median chunk length: {median_chunk_len:.2f}")
    plt.axvline(mean_chunk_len, color='g', linestyle='--', label=f"Mean chunk length: {mean_chunk_len:.2f}")
    plt.legend()
    plt.show()
plot_chunk_lengths(chunks["recursive_256"], "Chunk lengths for recursive 256 splitter")
No description has been provided for this image
plot_chunk_lengths(chunks["recursive_1024"], "Chunk lengths for recursive 1024 splitter")
No description has been provided for this image
plot_chunk_lengths(chunks["semantic"], "Chunk lengths for semantic splitter")
No description has been provided for this image
chunks_per_article = {splitter_name: Counter([chunk.metadata["title"] for chunk in chunks]) for splitter_name, chunks in chunks.items()}
counts = {splitter_name: [count for title, count in chunk_counts.items()] for splitter_name, chunk_counts in chunks_per_article.items()}

sns.histplot(counts, kde=True)
plt.title("Number of chunks per article")
plt.xlabel("Number of chunks")
plt.ylabel("Number of articles")
plt.legend(chunks_per_article.keys())
plt.show()
No description has been provided for this image

From our analysis of our created chunks we can see that the recursive chunks are all around the same size, close to the defined maximum. On the other hand, the semantic chunks vary in size. This is because the semantic chunking strategy is based on the semantic boundaries of the article.

We can also see that despite the semantic chunks being larger, the distribution of the number of chunks per article is much wider for the recursive chunks. This is because the recursive chunks are all around the same size, while the semantic chunks have many smaller ones and a few larger ones.

Generating EmbeddingsΒΆ

Now that we have clean chunks, the next step involves generating embeddings for our article chunks. These embeddings will serve as a crucial component for efficient retrieval within the RAG pipeline. For our vector store we'll utilize ChromaDB, a powerful tool for indexing and searching high-dimensional data. To integrate our chosen embedding models with ChromaDB, we'll define a custom wrapper class. This wrapper class will act as an intermediary, ensuring seamless communication between the models and the ChromaDB indexing system.

class CustomChromadbEmbeddingFunction(EmbeddingFunction):

    def __init__(self, model) -> None:
        super().__init__()
        self.model = model

    def _embed(self, l):
        return [self.model.embed_query(x) for x in l]

    def embed_query(self, query):
        return self._embed([query])

    def __call__(self, input: Documents) -> Embeddings:
        embeddings = self._embed(input)
        return embeddings
chroma_embedding_functions = {
    "mini": CustomChromadbEmbeddingFunction(embedding_models["mini"]),
    "bge-m3": CustomChromadbEmbeddingFunction(embedding_models["bge-m3"]),
    "gte": CustomChromadbEmbeddingFunction(embedding_models["gte"]),
}
for name, embedding_function in chroma_embedding_functions.items():
    sample = embedding_function(["Hello, world!"])[0][:5]
    print(f"{name} embedding sample: {sample}")
mini embedding sample: [0.034922659397125244, 0.01883005164563656, -0.017854738980531693, 0.00013884028885513544, 0.0740736573934555]
bge-m3 embedding sample: [-0.016155630350112915, 0.02699342556297779, -0.04258322715759277, 0.013542207889258862, -0.019354630261659622]
gte embedding sample: [0.03789481893181801, 0.3469243049621582, -0.2047133892774582, -0.21238623559474945, -0.49100759625434875]

Generating embeddings can be a computationally intensive process. To optimize efficiency and avoid redundant computations, we'll leverage checkpointing. This technique involves storing the generated embeddings along with their corresponding article chunks. We'll define a simple class to encapsulate this data, facilitating efficient retrieval and reducing the need for recalculating embeddings unless absolutely necessary.

embeddings_folder = silver_folder / "embeddings"
if not embeddings_folder.exists():
    embeddings_folder.mkdir()
class DocumentEmbedding():
    def __init__(self, document: Document, text_embedding: List[float]) -> None:
        self.document = document
        self.text_embedding = text_embedding
    
    def to_dict(self) -> Dict:
        return {
            "document": self.document.dict(),
            "text_embedding": self.text_embedding
        }
    
    @classmethod
    def from_dict(cls, d: Dict) -> "DocumentEmbedding":
        return cls(
            document=Document(**d["document"]),
            text_embedding=d["text_embedding"]
        )


def get_or_create_embeddings(
        embedding_function: EmbeddingFunction,
        chunks: List[Document],
        embedding_name: str,
) -> List[DocumentEmbedding]:
    embeddings_file = embeddings_folder / f"{embedding_name}_embeddings.json"
    if embeddings_file.exists():
        with open(embeddings_file, "r") as file:
            embeddings = [DocumentEmbedding.from_dict(embedding) for embedding in json.load(file)]
        print(f"Loaded {len(embeddings)} embeddings from {embeddings_file}")
    else:
        embeddings = []
        for chunk in tqdm(chunks):
            text_embedding = embedding_function([chunk.page_content])[0]
            embedding = DocumentEmbedding(
                document=chunk,
                text_embedding=text_embedding
            )
            embeddings.append(embedding)
        with open(embeddings_file, "w") as file:
            json.dump([embedding.to_dict() for embedding in embeddings], file, indent=4)
        print(f"Saved {len(embeddings)} embeddings to {embeddings_file}")
    return embeddings
embeddings = {}
for embedding_name, embedding_function in chroma_embedding_functions.items():
    for splitter_name, splitter_chunks in chunks.items():
        embeddings[f"{embedding_name}_{splitter_name}"] = get_or_create_embeddings(
            embedding_function, splitter_chunks, f"{embedding_name}_{splitter_name}"
        )
Loaded 25399 embeddings from data/silver/embeddings/mini_recursive_256_embeddings.json
Loaded 5754 embeddings from data/silver/embeddings/mini_recursive_1024_embeddings.json
Loaded 3146 embeddings from data/silver/embeddings/mini_semantic_embeddings.json
Loaded 25399 embeddings from data/silver/embeddings/bge-m3_recursive_256_embeddings.json
Loaded 5754 embeddings from data/silver/embeddings/bge-m3_recursive_1024_embeddings.json
Loaded 3146 embeddings from data/silver/embeddings/bge-m3_semantic_embeddings.json
Loaded 25399 embeddings from data/silver/embeddings/gte_recursive_256_embeddings.json
Loaded 5754 embeddings from data/silver/embeddings/gte_recursive_1024_embeddings.json
Loaded 3146 embeddings from data/silver/embeddings/gte_semantic_embeddings.json

Storing the Embeddings in ChromaDBΒΆ

As mentioned above for our semantic search retrieval we will be storing the embeddings in ChromaDB. ChromaDB is a powerful tool for indexing and searching high-dimensional data. It is based on the Hierarchical Navigable Small World (HNSW) algorithm, which is known for its efficiency in searching high-dimensional spaces.

Just like with normal sql databases we have a server, in this case an sqllite server, that we can connect to with a client. We will then use the client to connect to the server and create for each set of embeddings a new seperate database which can be thought of as the index or a vector space. Chromadb calls these vector spaces "collections". These collections will then be used to search for the most relevant chunks to a user query.

semantic search

gold_folder = data_folder / "gold"
if not gold_folder.exists():
    gold_folder.mkdir()
chromadb_folder = gold_folder / "chromadb"
if not chromadb_folder.exists():
    chromadb_folder.mkdir()

chroma_client = chromadb.PersistentClient(path=chromadb_folder.as_posix())
def get_or_create_collection(
        name: str,
        embedding_function: EmbeddingFunction,
        embeddings: List[DocumentEmbedding],
        batch_size: int = 128
) -> Collection:

    collection = chroma_client.get_or_create_collection(
        name=name,
        # configure to use cosine distance not default L2
        metadata={"hnsw:space": "cosine"},
        embedding_function=embedding_function
    )

    if collection.count() == 0:
        for i in tqdm(range(0, len(embeddings), batch_size)):
            batch = embeddings[i:i+batch_size]
            collection.add(
                documents=[embedding.document.page_content for embedding in batch],
                embeddings=[embedding.text_embedding for embedding in batch],
                ids=[str(embedding.document.metadata["id"]) for embedding in batch],
                metadatas=[embedding.document.metadata for embedding in batch]
            )

    return collection
collections = {}
for collection_name, current_embeddings in embeddings.items():
    collection = get_or_create_collection(
        collection_name,
        chroma_embedding_functions[collection_name.split("_")[0]],
        current_embeddings
    )
    collections[collection_name] = collection
    print(f"Collection {collection_name} has {collection.count()} documents")
Collection mini_recursive_256 has 25399 documents
Collection mini_recursive_1024 has 5754 documents
Collection mini_semantic has 3146 documents
Collection bge-m3_recursive_256 has 25399 documents
Collection bge-m3_recursive_1024 has 5754 documents
Collection bge-m3_semantic has 3146 documents
Collection gte_recursive_256 has 25399 documents
Collection gte_recursive_1024 has 5754 documents
Collection gte_semantic has 3146 documents

Once we have stored all the embeddings in ChromaDB we can test the retrieval process by querying one of our collections and see what the most similar chunks are. Try some different queries and see what the most similar chunks are and whether they make sense.

selected_collection = collections["gte_recursive_1024"]
results = selected_collection.query(
    query_texts=["Climate Change"],
    n_results=3,
)
for doc in results["documents"][0]:
    print(wrap_text(doc))
    print()
Report of the Intergovernmental Panel on Climate Change ( IPCC) makes for grim reading. It warns
that the world is heading for calamitous temperature rises and points to the need for economies to
decarbonise. The UK has set firm and ambitious targets and a pathway to net zero and CCUS will be
one of the tools which is used to achieve this.

scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions,
the findings reveal a previously unknown tipping point that if activated would release an important
brake on global warming, the authors said. `` We need to think about these worst-case scenarios to
understand how our CO2 emissions might affect the oceans not just this century, but next century
and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at
the University of Texas Institute for Geophysics. The study was published in the journal
Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions
generated by humans. Climate simulations had previously shown that the oceans slow their absorption
of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the
researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key
cause of the slowing. According to the findings, the

Potential Climatic Impact of Nord Stream Methane Leaks:  Nord Stream 1 and 2, two subsea pipelines
that transport natural gas from Russia to Germany, were both intentionally destroyed on September
26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and
eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea (
September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent
anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger.
As a result, whether this catastrophe may have detrimental climatic consequences is a major issue
around the world. This problem was discussed in a news article published in Nature, but no
quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciences’
Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using
the energy-conservation framework of the Intergovernmental

Analyzing the Embedding SpaceΒΆ

To gain a better understandign of how the retrieval process works we will analyze the embedding space. We will start by projecting the embeddings into a 2D space using UMAP. UMAP is a dimensionality reduction technique that is particularly well-suited for visualizing high-dimensional data in a lower-dimensional space. We will then use the UMAP embeddings to create a scatter plot of the chunks.

def get_vectors_from_collection(collection: Collection):
    stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
    return np.array(stored_chunks["embeddings"])

def get_vectors_by_domain(collection: Collection, domain: str):
    stored_chunks = collection.get(include=["documents", "metadatas", "embeddings"])
    metadatas = stored_chunks["metadatas"]
    indices = [str(metadata["id"]) for metadata in metadatas if metadata["domain"] == domain]
    return collection.get(include=["embeddings"], ids=indices)["embeddings"]

def fit_umap(vectors: np.ndarray):
    return umap.UMAP().fit(vectors)

def project_embeddings(embeddings, umap_transform):
    return umap_transform.transform(embeddings)
vectors = get_vectors_from_collection(selected_collection)
print(f"Original shape: {vectors.shape}")
umap_transform = fit_umap(vectors)
vectors_projections = project_embeddings(vectors, umap_transform)
print(f"Projected shape: {vectors_projections.shape}")
Original shape: (5754, 768)
Projected shape: (5754, 2)

You can zoom in the plot by clicking and dragging a box around the area you want to zoom in on. You can also reset the plot by double clicking on the plot.

fig = px.scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1])
fig.show()

Next we will color the embeddings by the domain of the article to see if there are any patterns or clusters in the embedding space based on the domain.

fig = go.Figure()
for domain in sample_df["domain"].unique():
    domain_vectors = get_vectors_by_domain(selected_collection, domain)
    domain_projections = project_embeddings(domain_vectors, umap_transform)
    fig.add_trace(go.Scatter(x=domain_projections[:, 0], y=domain_projections[:, 1], mode='markers', marker=dict(size=4), name=domain))

fig.show()

We can also visualize the retrieval process by plotting the query and the most similar chunks in the embedding space. This will give us a better understanding of how the retrieval process works and how the most similar chunks are found. Don't forget that the embeddings are in a high-dimensional space and we are only visualizing a 2D projection of them so the distances between the points might not be accurate. Try some different queries and see how the most similar chunks are found.

def plot_retrieval_results(
        query: str,
        selected_collection: Collection,
        n_results: int = 5
):
    vectors = get_vectors_from_collection(selected_collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_embedding = selected_collection._embedding_function([query])[0]
    query_embedding = np.array(query_embedding).reshape(1, -1)
    query_projection = project_embeddings(query_embedding, umap_transform)

    nearest_neighbors = selected_collection.query(
        query_texts=[query],
        n_results=n_results,
    )
    neighbor_vectors = selected_collection.get(include=["embeddings"], ids=nearest_neighbors["ids"][0])["embeddings"]
    neighbor_projections = project_embeddings(neighbor_vectors, umap_transform)
   

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
    fig.add_trace(go.Scatter(x=neighbor_projections[:, 0], y=neighbor_projections[:, 1], mode='markers', marker=dict(size=5, color='orange'), name="nearest neighbors"))
    fig.add_trace(go.Scatter(x=query_projection[:, 0], y=query_projection[:, 1], mode='markers', marker=dict(size=10, color='red', symbol='x'), name="query"))

    fig.show()
plot_retrieval_results(
    "Climate Change",
    selected_collection,
)

Lastly we will analyze the distribution of the cosine distances between the query and the different chunks. This will give us a better understanding of the cosine distance and show that the distances in the high-dimensional space are not the same as in the 2D projection. Do not confuse the cosine distance with the cosine similarity. The cosine similarity is the cosine of the angle between two vectors and the cosine distance is 1 minus the cosine similarity so that smaller numbers mean the vectors are more similar.

def cosine_distance(vector1, vector2):
    dot_product = np.dot(vector1, vector2.T)
    norm_product = np.linalg.norm(vector1) * np.linalg.norm(vector2)
    similarity = dot_product / norm_product
    return 1 - similarity

def plot_cosine_distances(
        query: str,
        selected_collection: Collection
):
    vectors = get_vectors_from_collection(selected_collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_embedding = selected_collection._embedding_function([query])[0]
    query_embedding = np.array(query_embedding).reshape(1, -1)
    query_projection = project_embeddings(query_embedding, umap_transform)

    similarities = np.array([cosine_distance(query_embedding, vector) for vector in vectors])

    fig = go.Figure()
    fig.add_trace(go.Scatter(
        x=vectors_projections[:, 0],
        y=vectors_projections[:, 1],
        mode='markers',
        marker=dict(
            size=5,
            color=similarities.flatten(),
            colorscale='RdBu',
            colorbar=dict(title='Cosine Distance')
        ),
        text=['Cosine Distance: {:.4f}'.format(
            sim) for sim in similarities.flatten()],
        name='Other Vectors'
    ))

    fig.add_trace(go.Scatter(x=[query_projection[0][0]], y=[
                query_projection[0][1]], mode='markers', marker=dict(size=10, color='black', symbol='x'), text=['Query Vector'], name='Query Vector'))

    fig.show()
plot_cosine_distances(
    "Climate Change",
    selected_collection,
)

Putting it all TogetherΒΆ

Now that we have generated the embeddings and stored them in ChromaDB we can put it all together and create the RAG pipeline. The RAG pipeline consists of the following steps:

How does Langchain work?ΒΆ

In this notebook we will be using Langchain to build up our pipeline. You do not need a library like Langchain or LlamaIndex to build a RAG pipeline, but it can make the process easier.

The idea of Langchain and its LCEL (Langchain Expression Language) is very simple. Within the pipeline there are lots of steps that take an input and produce an output. These steps can be chained together to form a pipeline. The LCEL is a simple language that allows you to define these steps and how they are connected. For more technical details on how Langchain works check out the Langchain Documentation.

In simple terms langchain provides an abstraction of a step that has an invoke method that takes an input, a dictionary of parameters and returns an output also a dictionary. This allows you to chain together different steps and define how they are connected and also split of chains of steps into separate pipelines.

Below you can see an overview of our RAG pipeline:

rag_pipeline

def create_qa_chain(retriever: BaseRetriever):
    template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. \
    If you don't know the answer, just say that you don't know. Keep the answer concise.

    Question: {question}
    Context: {context}
    Answer:
    """
    rag_prompt = ChatPromptTemplate.from_template(template)

    def format_docs(docs):
        return "\n\n".join(doc.page_content for doc in docs)

    rag_chain = RunnableParallel(
        {
            "context": retriever,
            "question": RunnablePassthrough()
        }
    ).assign(answer=(
         RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
            | rag_prompt
            | llm
            | StrOutputParser()
    ))

    return rag_chain

For Langchain to work with our chromadb collections we need to transform the collections into a format that Langchain can work with so called stores and retrievers.

def collection_to_store(collection_name: str, lc_embedding_model: EmbeddingFunction):
    return Chroma(
        client=chroma_client,
        collection_name=collection_name,
        embedding_function=lc_embedding_model,
    )

def store_to_retriever(store: VectorStore, k: int = 3):
    retriever = store.as_retriever(
        search_type="similarity", search_kwargs={'k': k}
    )
    return retriever
selected_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])
selected_retriever = store_to_retriever(selected_store)
selected_retriever.invoke("Climate Change")
[Document(page_content='Report of the Intergovernmental Panel on Climate Change ( IPCC) makes for grim reading. It warns that the world is heading for calamitous temperature rises and points to the need for economies to decarbonise. The UK has set firm and ambitious targets and a pathway to net zero and CCUS will be one of the tools which is used to achieve this.', metadata={'domain': 'energyvoice', 'id': 3173, 'title': 'The 10 Point Pod delves deep into the heart of CCUS', 'url': 'energyvoice.com/promoted/347021/the-10-point-pod-delves-deep-into-the-heart-of-ccus'}),
 Document(page_content="scenario used in the study is unlikely because of global efforts to limit greenhouse gas emissions, the findings reveal a previously unknown tipping point that if activated would release an important brake on global warming, the authors said. `` We need to think about these worst-case scenarios to understand how our CO2 emissions might affect the oceans not just this century, but next century and the following century, '' said Megumi Chikamoto, who led the research as a research fellow at the University of Texas Institute for Geophysics. The study was published in the journal Geophysical Research Letters. Today, the oceans soak up about a third of the CO2 emissions generated by humans. Climate simulations had previously shown that the oceans slow their absorption of CO2 over time, but none had considered alkalinity as explanation. To reach their conclusion, the researchers recalculated pieces of a 450-year simulation until they hit on alkalinity as a key cause of the slowing. According to the findings, the", metadata={'domain': 'azocleantech', 'id': 633, 'title': 'Global Warming Could Trigger Chemical Changes in the Ocean Surface that Accelerate Climate Change', 'url': 'azocleantech.com/news.aspx?newsID=33053'}),
 Document(page_content='Potential Climatic Impact of Nord Stream Methane Leaks:  Nord Stream 1 and 2, two subsea pipelines that transport natural gas from Russia to Germany, were both intentionally destroyed on September 26th, 2022. Enormous amounts of gases, mainly methane, were discharged into the ocean and eventually into the atmosphere. Methane escaping from sabotaged pipelines in the Baltic Sea ( September 27th, 2022). Image Credit: Danish Armed Forces Methane is the second most prevalent anthropogenic greenhouse gas after CO2, although its greenhouse effect is substantially stronger. As a result, whether this catastrophe may have detrimental climatic consequences is a major issue around the world. This problem was discussed in a news article published in Nature, but no quantitative implications were reached. Recently, scientists from the Chinese Academy of Sciences’ Institute of Atmospheric Physics approximated the potential climatic effect of leaked methane using the energy-conservation framework of the Intergovernmental', metadata={'domain': 'azocleantech', 'id': 614, 'title': 'Potential Climatic Impact of Nord Stream Methane Leaks', 'url': 'azocleantech.com/news.aspx?newsID=32568'})]

Now that we have our retriever we can create our RAG pipeline. Try some different queries and see how the pipeline responds.

selected_chain = create_qa_chain(selected_retriever)
selected_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='Blue River, Vida, Phoenix, and Talentβ€”were lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 57, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing what’ s known as the β€œ vapor pressure deficit, ” or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isn’ t the only factor behind the west’ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 58, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This year’ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 59, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
 'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
 'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
chains = {}
for collection_name, collection in collections.items():
    store = collection_to_store(collection_name, embedding_models[collection_name.split("_")[0]])
    retriever = store_to_retriever(store)
    chain = create_qa_chain(retriever)
    chains[collection_name] = chain

chains.keys()
dict_keys(['mini_recursive_256', 'mini_recursive_1024', 'mini_semantic', 'bge-m3_recursive_256', 'bge-m3_recursive_1024', 'bge-m3_semantic', 'gte_recursive_256', 'gte_recursive_1024', 'gte_semantic'])

EvaluationΒΆ

Because we have many hyperparameters such as chunk size, prompts etc. to tune and different strategies to try we will use the RAGAS (RAG Assesment) framework to evaluate our pipeline. RAGAS is a framework that allows you to evaluate your RAG pipeline with an LLM as a judge and other metrics that also utilize embedding models. We will go more into detail on the metrics later on.

Before we can start the evaluation we need to define the evaluation questions and their ground truth answers. For this we will use the provided evaluation questions. To increase our question pool we will also generate some additional question and answer pairs based on a random chunk and utilizing the LLM to generate the question and answer.

human_eval_df.head()
question relevant_section url
example_id
1 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.energyvoice.com/strategy/technology/23...
2 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
3 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... pv-magazine.com/2023/02/02/european-commission...
4 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.energyvoice.com/policy/25396/eu-seeks-...
5 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... cleantechnica.com/2023/05/08/general-motors-se...
def generate_eval_answers(df: pd.DataFrame) -> pd.DataFrame:
    answer_geneation_prompt = """Answer the following question based on the article:
    Question: {question}
    Article: {article}
    """
    answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm
    for i, row in tqdm(df.iterrows(), total=len(df)):
        df.at[i, "ground_truth"] = answer_generation_chain.invoke({"question": row["question"], "article": row["relevant_section"]}).content
    return df
if (silver_folder / "human_eval.csv").exists():
    human_eval_df = pd.read_csv(silver_folder / "human_eval.csv")
else:
    human_eval_df = generate_eval_answers(human_eval_df)
    human_eval_df.to_csv(silver_folder / "human_eval.csv", index=False)

human_eval_df.head()
question relevant_section url ground_truth
0 What is the innovation behind LeclanchΓ©'s new ... LeclanchΓ© said it has developed an environment... sgvoice.energyvoice.com/strategy/technology/23... The innovation behind LeclanchΓ©'s new method t...
1 What is the EU’s Green Deal Industrial Plan? The Green Deal Industrial Plan is a bid by the... sgvoice.energyvoice.com/policy/25396/eu-seeks-... The EU’s Green Deal Industrial Plan is an init...
2 What is the EU’s Green Deal Industrial Plan? The European counterpart to the US Inflation R... pv-magazine.com/2023/02/02/european-commission... The EU’s Green Deal Industrial Plan is aimed a...
3 What are the four focus areas of the EU's Gree... The new plan is fundamentally focused on four ... sgvoice.energyvoice.com/policy/25396/eu-seeks-... The four focus areas of the EU's Green Deal In...
4 When did the cooperation between GM and Honda ... What caught our eye was a new hookup between G... cleantechnica.com/2023/05/08/general-motors-se... The cooperation between GM and Honda on fuel c...
def generate_synthetic_qa_pairs(documents: List[Document], n: int = 10) -> List[str]:
    synthetic_questions = []
    documents = np.random.choice(documents, n)

    question_generation_prompt = """Generate a short and general question based on the following news article:
    Article: {article}
    """
    question_generation_chain = ChatPromptTemplate.from_template(question_generation_prompt) | llm

    answer_geneation_prompt = """Answer the following question based on the article:
    Question: {question}
    Article: {article}
    """
    answer_generation_chain = ChatPromptTemplate.from_template(answer_geneation_prompt) | llm


    for document in tqdm(documents):
        element = {}
        content = document.page_content
        element["relevant_section"] = content
        element["url"] = document.metadata["url"]
        question = question_generation_chain.invoke({"article": content}).content
        element["question"] = question
        answer = answer_generation_chain.invoke({"question": question, "article": content}).content
        element["ground_truth"] = answer
        synthetic_questions.append(element)

    return pd.DataFrame(synthetic_questions)
if not (silver_folder / "synthetic_eval.csv").exists():
    synthetic_eval_df = generate_synthetic_qa_pairs(chunks["recursive_1024"], 25)
    synthetic_eval_df.to_csv(silver_folder / "synthetic_eval.csv", index=False)
else:
    synthetic_eval_df = pd.read_csv(silver_folder / "synthetic_eval.csv", index_col=0)
synthetic_eval_df.head()
url question ground_truth
relevant_section
Climate Shifts Forcefully Against Big Oil: The relationship between Big Oil and society is fundamentally changing. Public companies on both sides of the Atlantic are coming under a level of pressure to decarbonize their operations that was unthinkable just a year or two ago ( PIW Aug.7'20). This pressure is being wielded by investors as well as by court systems in some jurisdictions. The impact to corporate strategies could be enormous if companies feel they must respond to the heat by unwinding oil and gas operations earlier than planned. In one pivotal day this week, Exxon Mobil saw the tiny Engine No. 1 hedge fund unseat two -- and possibly three -- of its directors by harnessing the voting power of major pension and index funds. Chevron became the latest US company asked to set Scope 3 emissions reduction targets, following similar votes at ConocoPhillips and Phillips 66. And Royal Dutch Shell lost a Dutch court case that could force it to slash emissions by 45% by 2030, and redefine the obligations of energyintel.com/0000017b-a7dd-de4c-a17b-e7df5a... How is the relationship between Big Oil compan... Big Oil companies are facing increasing pressu...
Government consults on changes to supply chain plans and CfD delivery: The UK government is consulting on changes to supply chain plan and Contracts for Difference ( CfD) policy in preparation for its fifth auction round. Launched February 4, the consultation by the Department for Business, Energy and Industrial Strategy ( BEIS) aims to garner industry input to make the CfD process β€œ more adaptable and forward looking. ” BEIS is inviting view views on the questions and pass threshold for the supply chain plan ( SCP) questionnaire, including the mooted introduction of interviews as part of the process; extending supply chain policy to support emerging technologies, starting with floating offshore wind projects; strengthening its disincentives for non-delivery; and amending Regulation 51 ( 10) ( c) of the CfD regulations which govern proposed project commissioning dates. Currently, developers aiming to build projects of 300MW or more must apply for an SCP statement from the Secretary of State for BEIS to take energyvoice.com/renewables-energy-transition/3... What changes are being proposed by the UK gove... The UK government is proposing changes to supp...
volumes at the plant were now averaging 140 metric tons per day versus a 150 metric ton/d target. A Tepco spokesperson told Energy Intelligence Feb. 12 that relatively lower amounts of heavy rainfall in typhoons last fall helped ease pressure, but confirmed that `` it is still necessary to systematically construct necessary facilities, ” such as more tanks, and β€œ effectively use the entire site. '' `` They have been working very hard to successfully reduce the inflow of groundwater and rainwater that enter the building basements, becoming contaminated, requiring processing, and eventual additional tank storage, '' Lake Barrett, a senior adviser to Tepco, told Energy Intelligence. He added that this is a relatively dry season so short-term numbers may be misleading but that the `` trend is solidly downward and they are continuing actions to even further reduce in-leakage. '' This is giving `` the government more flexibility to find the least bad time to decide something, '' but a decision on water energyintel.com/0000017b-a7dc-de4c-a17b-e7de9e... What measures are being taken by Tepco to redu... Tepco is taking measures such as constructing ...
rules. β€œ I think that BOEM is really looking to the industry to help with the development of these regulations, ” De Cort told an audience at OTC. β€œ We’ ve had some unofficial calls for where people want to look for these things. ” Exxon Mobil made waves last year in oil and gas Lease Sale 257 when it bid on more than 90 shallow-water blocks that sources say could be the β€œ sweet spot ” in the Gulf for injecting CO2 because of their geological characteristics. Unfortunately for Exxon, which is planning a $ 100 billion CCS project in the Houston Ship Channel, that auction was annulled by a federal court, meaning the company will likely not get to enjoy any first-mover advantage it had in attempting to secure that acreage. energyintel.com/00000180-9b95-d98b-adb6-ff9590... What role is the oil and gas industry playing ... The oil and gas industry, such as Exxon Mobil,...
Heat Pumps – Page 2 – pv magazine International: The Gothenburg district court in Sweden has charged eight people for allegedly stealing nearly $ 416,000 of air source heat pumps, geothermal heat pumps, white goods and tools from multiple locations in the western part of the country. Samsung and SMA are using a new cloud-to-cloud system that allows PV systems with SMA inverters to be integrated with Samsung heat pumps. Toshiba Carrier has been recognized at the 2023 National Invention Awards of Japan for its innovative discharge port structure in multi-cylinder rotary piston compressors for heat pumps. The technology tackles the problem of overheating, resulting in improved heating capacity and efficiency. More than 20 companies, governments, and nongovernmental organizations have presented EU Energy Commissioner Kadri Simson with a roadmap for the European heat pump sector, including recommended solutions to overcome barriers to growth. Germany’ s MAN Energy Solutions has supplied two 50 MW seawater heat pv-magazine.com/category/heat-pumps/page/2 What innovative solutions are being developed ... One innovative solution being developed in the...
question_length = {
    "human": human_eval_df["question"].map(len),
    "synthetic": synthetic_eval_df["question"].map(len)
}

sns.histplot(question_length, kde=True)
plt.title("Question Length Distribution")
plt.xlabel("Question Length")
plt.ylabel("Count")
plt.show()
No description has been provided for this image
eval_df = pd.concat([human_eval_df, synthetic_eval_df], ignore_index=True)
eval_df["is_synthetic"] = eval_df["relevant_section"].isna()
eval_df["is_synthetic"].value_counts()
is_synthetic
True     25
False    23
Name: count, dtype: int64

Now we have doubled the number of questions and answers. However, we can see that our synthetic questions are slightly longer than the provided questions which could mean that they are slightly easier to answer. This potential bias should be taken into account when evaluating the pipeline.

RAGAS MetricsΒΆ

RAGAS provides a variety of metrics to evaluate the performance of a RAG pipeline. Here are some of the key metrics we will be using and how they are calculated:

For this to work we create a test dataset for each of our RAG pipelines that contains the evaluation questions and their ground truth answers. We then run all the questions through our RAG pipeline and store the generated answers and the retrieved chunks. We can then use this test dataset to calculate the RAGAS metrics.

datasets_folder = gold_folder / "datasets"
if not datasets_folder.exists():
    datasets_folder.mkdir()

def get_or_create_eval_dataset(name: str, df: pd.DataFrame, chain: Chain) -> Dataset:
    dataset_file = datasets_folder/ f"{name}_dataset.json"
    if dataset_file.exists():
        with open(dataset_file, "r") as file:
            dataset = Dataset.from_dict(json.load(file))
        print(f"Loaded {name} dataset from {dataset_file}")
    else:
        datapoints = {
            "question": df["question"].tolist(),
            "answer": [],
            "contexts": [],
            "ground_truth": df["ground_truth"].tolist(),
            "context_urls": []
        }
        for question in tqdm(datapoints["question"]):
            result = chain.invoke(question)
            datapoints["answer"].append(result["answer"])
            datapoints["contexts"].append([str(doc.page_content) for doc in result["context"]])
            datapoints["context_urls"].append([doc.metadata["url"] for doc in result["context"]])
        dataset = Dataset.from_dict(datapoints)
        with open(dataset_file, "w") as file:
            json.dump(dataset.to_dict(), file)
        print(f"Saved {name} dataset to {dataset_file}")
    return dataset
results_folder = gold_folder / "results"
if not results_folder.exists():
    results_folder.mkdir()

def get_or_run_llm_eval(name: str, dataset: Dataset, llm_judge_model: LLM) -> pd.DataFrame:
    eval_results_file = results_folder / f"{name}_llm_eval_results.csv"
    if eval_results_file.exists():
        eval_results = pd.read_csv(eval_results_file)
        print(f"Loaded {name} evaluation results from {eval_results_file}")
    else:
        eval_results = evaluate(dataset,
                                metrics=[faithfulness, answer_relevancy, context_relevancy, answer_correctness],
                                is_async=True,
                                llm=llm_judge_model,
                                embeddings=embedding_models["gte"],
                                run_config=RunConfig(
                                    timeout=60, max_retries=10, max_wait=60, max_workers=8),
                                ).to_pandas()
        eval_results.to_csv(eval_results_file, index=False)
        print(f"Saved {name} evaluation results to {eval_results_file}")
    return eval_results
def plot_llm_eval(name: str, eval_results: pd.DataFrame):
    # select only the float64 columns (assuming these are the RAGAS metrics)
    ragas_metrics_data = (eval_results
                        .select_dtypes(include=[np.float64]))


    # boxplot of distributions
    sns.boxplot(data=ragas_metrics_data, palette="Set2")
    plt.title(f'{name}: Distribution of RAGAS Evaluation Metrics')
    plt.ylabel('Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

    # barplot of means
    means = ragas_metrics_data.mean()
    plt.figure(figsize=(14, 8))
    sns.barplot(x=means.index, y=means, palette="Set2")
    plt.title(f'{name}: Mean of RAGAS Evaluation Metrics')
    plt.ylabel('Mean Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)

    plt.tight_layout()
    plt.show()
def plot_multiple_evals(eval_results: Dict[str, pd.DataFrame]):
    # combine the results
    full_results = []
    for name, results in eval_results.items():
        results['name'] = name
        full_results.append(results)

    full_results = pd.concat(full_results, ignore_index=True)
    full_results = full_results.sort_values(by='name')


    # select only the float64 columns (assuming these are the RAGAS metrics)
    ragas_metrics_data = full_results.select_dtypes(include=[np.float64])
    ragas_metrics_data['name'] = full_results['name']
    
    # boxplot of distributions
    plt.figure(figsize=(14, 8))
    sns.boxplot(x='variable', y='value', hue='name', data=pd.melt(ragas_metrics_data, id_vars='name'), palette="Set2")
    plt.title('Distribution of RAGAS Evaluation Metrics by Model')
    plt.ylabel('Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)
    plt.legend(title='Model')
    plt.tight_layout()
    plt.show()
    
    # barplot of means
    means = ragas_metrics_data.groupby('name').mean().reset_index()
    means_melted = pd.melt(means, id_vars='name')
    
    plt.figure(figsize=(14, 8))
    sns.barplot(x='variable', y='value', hue='name', data=means_melted, palette="Set2")
    plt.title('Mean of RAGAS Evaluation Metrics by Model')
    plt.ylabel('Mean Scores')
    plt.xlabel('Metrics')
    plt.xticks(rotation=45)
    plt.legend(title='Model')
    plt.tight_layout()
    plt.show()
selected_dataset = get_or_create_eval_dataset("selected", eval_df, selected_chain)
Loaded selected dataset from data/gold/datasets/selected_dataset.json
selected_llm_eval_results = get_or_run_llm_eval("selected", selected_dataset, llm)
plot_llm_eval("selected", selected_llm_eval_results)
Loaded selected evaluation results from data/gold/results/selected_llm_eval_results.csv
No description has been provided for this image No description has been provided for this image
datasets = {}
for name, chain in chains.items():
    datasets[name] = get_or_create_eval_dataset(name, eval_df, chain)
Loaded mini_recursive_256 dataset from data/gold/datasets/mini_recursive_256_dataset.json
Loaded mini_recursive_1024 dataset from data/gold/datasets/mini_recursive_1024_dataset.json
Loaded mini_semantic dataset from data/gold/datasets/mini_semantic_dataset.json
Loaded bge-m3_recursive_256 dataset from data/gold/datasets/bge-m3_recursive_256_dataset.json
Loaded bge-m3_recursive_1024 dataset from data/gold/datasets/bge-m3_recursive_1024_dataset.json
Loaded bge-m3_semantic dataset from data/gold/datasets/bge-m3_semantic_dataset.json
Loaded gte_recursive_256 dataset from data/gold/datasets/gte_recursive_256_dataset.json
Loaded gte_recursive_1024 dataset from data/gold/datasets/gte_recursive_1024_dataset.json
Loaded gte_semantic dataset from data/gold/datasets/gte_semantic_dataset.json
llm_results = {}
for dataset_name, dataset in datasets.items():
    llm_results[dataset_name] = get_or_run_llm_eval(dataset_name, dataset, llm)
Loaded mini_recursive_256 evaluation results from data/gold/results/mini_recursive_256_llm_eval_results.csv
Loaded mini_recursive_1024 evaluation results from data/gold/results/mini_recursive_1024_llm_eval_results.csv
Loaded mini_semantic evaluation results from data/gold/results/mini_semantic_llm_eval_results.csv
Loaded bge-m3_recursive_256 evaluation results from data/gold/results/bge-m3_recursive_256_llm_eval_results.csv
Loaded bge-m3_recursive_1024 evaluation results from data/gold/results/bge-m3_recursive_1024_llm_eval_results.csv
Loaded bge-m3_semantic evaluation results from data/gold/results/bge-m3_semantic_llm_eval_results.csv
Loaded gte_recursive_256 evaluation results from data/gold/results/gte_recursive_256_llm_eval_results.csv
Loaded gte_recursive_1024 evaluation results from data/gold/results/gte_recursive_1024_llm_eval_results.csv
Loaded gte_semantic evaluation results from data/gold/results/gte_semantic_llm_eval_results.csv
plot_multiple_evals(llm_results)
No description has been provided for this image No description has been provided for this image

From the evaluation we can see that the RAG pipeline using the GTE embedding model by alibaba and recursive chunking with a chunk size of 1024 has the best performance. This is likely due to the fact that the GTE embedding model is the most powerful and the recursive chunking with a chunk size of 1024 provides the most context to the LLM.

best_collection = collections["gte_recursive_1024"]
best_store = collection_to_store("gte_recursive_1024", embedding_models["gte"])

Advanced MethodsΒΆ

In this final section we will look at some more advanced methods to improve our RAG pipeline and comparing them to our best performing pipeline.

Multi-QueryingΒΆ

Multi-querying is a technique that involves querying the retrieval model with multiple questions to retrieve relevant chunks. This approach can enhance the retrieval process by leveraging the diversity of queries to capture a broader range of relevant information. By combining the results from multiple queries, we can potentially improve the quality of the retrieved chunks and, consequently, the generated responses. When creating these additional queries the goal is to create queries that are different from the original query but still relevant to the user's information need, i.e variations of the original query.

multi-querying

def generate_query_variations(query: str, num_additional_queries: int) -> List[str]:
    multiquery_prompt = """You are an assistant tasked with generating {num_queries} \
    different versions of the given user question to retrieve relevant documents from a vector \
    database. By generating multiple perspectives on the user question and breaking it down, your goal is to help \
    the user overcome some of the limitations of the distance-based similarity search. \
    Provide these alternative questions separated by newlines without any numbering or listing.
    Original question: {question}
    Alternatives:
    """

    multiquery_chain = ChatPromptTemplate.from_template(multiquery_prompt) | llm
    return multiquery_chain.invoke({"question": query, "num_queries": num_additional_queries}).content.split("\n")
def plot_multiquery_retrieval_results(query: str, collection : Collection, num_additional_queries: int = 3, num_results: int = 3):
    vectors = get_vectors_from_collection(collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)

    query_variations = generate_query_variations(query, 5)
    query_variations_projections = project_embeddings(collection._embedding_function(query_variations), umap_transform)

    original_relevant_docs = collection.query(
        query_texts=[query],
        n_results=num_results,
    )
    original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
    original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
    original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
    
    additional_relevant_docs = collection.query(
        query_texts=query_variations,
        n_results=num_results,
    )
    additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten 
    # remove duplicates
    additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
    # remove the original relevant docs from the additional relevant docs
    additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
    additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
    additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
    fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
    fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="query variations"))
    fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
    fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
    
    fig.show()
plot_multiquery_retrieval_results("Climate Change", selected_collection)
class MultiQueryRetriever(BaseRetriever):
    store: VectorStore
    num_additional_queries: int = 3
    num_results: int = 3

    def _get_query_variations(self, query: str) -> List[str]:
       return generate_query_variations(query, self.num_additional_queries)

    def _get_relevant_documents(
        self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        queries = self._get_query_variations(original_query)
        queries.append(original_query)
        retriever = store_to_retriever(self.store, k=self.num_results)
        relevant_docs = []
        for query in queries:
            results = retriever.invoke(query, run_manager=run_manager)
            # remove duplicates
            for res in results:
                if res not in relevant_docs:
                    relevant_docs.append(res)
        return relevant_docs
multiquery_retriever = MultiQueryRetriever(store=best_store, num_additional_queries=3, num_results=3)
multiquery_chain = create_qa_chain(multiquery_retriever)
multiquery_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing what’ s known as the β€œ vapor pressure deficit, ” or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isn’ t the only factor behind the west’ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 58, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='Blue River, Vida, Phoenix, and Talentβ€”were lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 57, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other plants that have sprung up as a result of the wet weather could quickly turn into dry kindling for wildfires as the dry season wears on into late summer and fall. According to the latest wildland fire outlook, most of the western United States is expected to experience either normal or below-normal fire activity between May and August this year. Source: National Interagency Fire Center. There are many different ways to measure wildfire activity, but by almost any metric, wildfires across the western US and southwestern Canada are worsening. Reliable, consistent wildfire metrics across the region started to become available in the mid-1980s. Here’ s what the trends show. From 1984 to 1999, the region experienced an average of roughly 230 fires per year. From 2000 to 2021, the average was more than 350 fires per year. The number of wildfires larger than 1,000 acres in western North', metadata={'domain': 'cleantechnica', 'id': 52, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This year’ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 59, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
 'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
 'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["multiquery"] = get_or_create_eval_dataset("multiquery", eval_df, multiquery_chain)
Loaded multiquery dataset from data/gold/datasets/multiquery_dataset.json
llm_results["multiquery"] = get_or_run_llm_eval("multiquery", datasets["multiquery"], llm)
Loaded multiquery evaluation results from data/gold/results/multiquery_llm_eval_results.csv
strategy_results = {}
strategy_results["gte_recursive_1024"] = llm_results["gte_recursive_1024"]
strategy_results["multiquery"] = llm_results["multiquery"]
plot_multiple_evals(strategy_results)
No description has been provided for this image No description has been provided for this image

We can see that on average the answer correctness does slightly increase when using multi-querying. This is likely due to the fact that the retrieval process is more robust and can capture a broader range of relevant information. However, the faithfullness and context_relevancy decrease could be due to the multi-querying introducing more noise into the retrieval process by retrieving more chunks in general and some of them being less relevant.

HyDE - Hypothetical Document EmbeddingsΒΆ

The idea of the HyDE method is to generate hypothetical documents that are similar to the user query and then retrieve the most similar chunks to these hypothetical documents. This can be useful when the user query is not very specific or when the user query is not very similar to the chunks. The HyDE method can be used to generate hypothetical documents that are more similar to the chunks and therefore improve the retrieval process. Another way to think about it is generating a hypothetical answer and therby reaching an area in the embedding space that is more similar to the actual answer which might not be reachable from the user query.

hyde

def generate_hypothetical_document(query: str, num_hypotheses: int) -> List[str]:
    hyde_prompt = """Please write a news passage about the topic.
    Topic: {query}
    Passage:
    """

    hyde_chain = ChatPromptTemplate.from_template(hyde_prompt) | llm
    hypothetical_documents = [hyde_chain.invoke({"query": query}).content for _ in range(num_hypotheses)]
    return hypothetical_documents
def plot_hyde_retrieval_results(query: str, collection : Collection, num_hypo_documents: int = 2, num_results: int = 3):
    vectors = get_vectors_from_collection(collection)
    umap_transform = fit_umap(vectors)
    vectors_projections = project_embeddings(vectors, umap_transform)

    query_projections = project_embeddings(collection._embedding_function([query]), umap_transform)

    hypothetical_documents = generate_hypothetical_document(query, num_hypo_documents)
    query_variations_projections = project_embeddings(collection._embedding_function(hypothetical_documents), umap_transform)

    original_relevant_docs = collection.query(
        query_texts=[query],
        n_results=num_results,
    )
    original_relevant_docs_ids = [item for sublist in original_relevant_docs["ids"] for item in sublist] # flatten
    original_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=original_relevant_docs_ids)["embeddings"]
    original_relevant_docs_projections = project_embeddings(original_relevant_docs_embeddings, umap_transform)
    
    additional_relevant_docs = collection.query(
        query_texts=hypothetical_documents,
        n_results=num_results,
    )
    additional_relevant_docs_ids = [item for sublist in additional_relevant_docs["ids"] for item in sublist] # flatten 
    # remove duplicates
    additional_relevant_docs_ids = list(set(additional_relevant_docs_ids))
    # remove the original relevant docs from the additional relevant docs
    additional_relevant_docs_ids = [doc_id for doc_id in additional_relevant_docs_ids if doc_id not in original_relevant_docs_ids]
    additional_relevant_docs_embeddings = collection.get(include=["embeddings"], ids=additional_relevant_docs_ids)["embeddings"]
    additional_relevant_docs_projections = project_embeddings(additional_relevant_docs_embeddings, umap_transform)

    fig = go.Figure()

    fig.add_trace(go.Scatter(x=vectors_projections[:, 0], y=vectors_projections[:, 1], mode='markers', marker=dict(size=5), name="other vectors"))
    fig.add_trace(go.Scatter(x=query_projections[:, 0], y=query_projections[:, 1], mode='markers', marker=dict(size=7, color='black', symbol='x'), name="original query"))
    fig.add_trace(go.Scatter(x=query_variations_projections[:, 0], y=query_variations_projections[:, 1], mode='markers', marker=dict(size=7, color='red', symbol='x'), name="hypothetical documents"))
    fig.add_trace(go.Scatter(x=original_relevant_docs_projections[:, 0], y=original_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='orange'), name="original relevant docs"))
    fig.add_trace(go.Scatter(x=additional_relevant_docs_projections[:, 0], y=additional_relevant_docs_projections[:, 1], mode='markers', marker=dict(size=7, color='green'), name="additional relevant docs"))
    
    fig.show()
plot_hyde_retrieval_results("Climate Change", selected_collection)
class HyDERetriever(BaseRetriever):
    store: VectorStore
    num_hypo_documents: int = 2
    num_results: int = 3

    def _get_hypothetical_documents(self, query: str) -> List[str]:
        return generate_hypothetical_document(query, self.num_hypo_documents)

    def _get_relevant_documents(
        self, original_query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        hypothetical_documents = self._get_hypothetical_documents(original_query)
        hypothetical_documents.append(original_query)
        retriever = store_to_retriever(self.store, k=self.num_results)
        relevant_docs = []
        for query in hypothetical_documents:
            results = retriever.invoke(query, run_manager=run_manager)
            # remove duplicates
            for res in results:
                if res not in relevant_docs:
                    relevant_docs.append(res)
        return relevant_docs
hyde_retriever = HyDERetriever(store=best_store, k=3)
hyde_chain = create_qa_chain(hyde_retriever)
hyde_chain.invoke("Where are the biggest increases in wildfire smoke exposure in recent years?")
{'context': [Document(page_content='the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In addition to their impact on air quality, wildfires can disrupt processes that maintain access to drinking water, including by reducing the ability of soil to absorb water when it rains and sending additional sediment into drinking water systems. The science is clear that climate change is increasing what’ s known as the β€œ vapor pressure deficit, ” or VPD, across western North America. When VPD is high, the atmosphere can pull more water out of plants, which dries them out and makes them more likely to burn. VPD also is also a good metric for drought, including the long 21st century drought the region has been experiencing. But climate isn’ t the only factor behind the west’ s worsening wildfires. More than a century of aggressive fire suppression and an even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too', metadata={'domain': 'cleantechnica', 'id': 58, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='Blue River, Vida, Phoenix, and Talentβ€”were lost to the so-called Labor Day Fires in 2020. And in 2021, the Lytton Creek Fire wiped out the village of Lytton, British Columbia, destroying hundreds of homes. All told, between 2017 and 2021, nearly 120,000 fires burned across western North America, burning nearly 39 million acres of land and claiming more than 60,000 structures. The impacts of wildfires reach well beyond the people, communities, and ecosystems that are directly affected by flames. Wildfires have consequences for public health, water supplies, and economies long after a fire is extinguished. Mounting research is showing exposure to the fine particulate matter in wildfire smoke is responsible for thousands of indirect deaths, increases to the risk of pre-term birth among pregnant women, and even an increase in the risk of COVID-19 illness and death. Surprisingly, some of the biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas. In', metadata={'domain': 'cleantechnica', 'id': 57, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='Let’ s dive into western wildfires by the numbers. As spring turns to summer and the days warm up, the Northern Hemisphere enters the period known as Danger Season, when wildfires, heat waves, and hurricanes, all amplified by climate change, begin to ramp up. In the western United States, the start of Danger Season is marked by the shift from the wintertime wet season to the summertime dry season. While wildfires can and do occur all year round, this shift from cool and wet to warm and dry marks the start of wildfire season in the region. According to the latest seasonal outlook from the National Interagency Fire Center, the exceptionally rainy and snowy conditions the west experienced during the winter of 2022-2023 are translating to below-average to normal levels of wildfire risk across most western states at least through August. That said, above-normal activity is expected for parts of Washington, Oregon, Idaho, and Nevada. Some scientists are also raising concerns that all the young grasses and other', metadata={'domain': 'cleantechnica', 'id': 51, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'}),
  Document(page_content='even longer period of settler colonial repression of Indigenous burning practices have led to forests that are too dense, too uniform in their species, and without the resistance to fire they once had. Moreover, a lack of affordable housing across the region and the desire for proximity to beautiful, natural places has led to large increases in the number of people living in wildfire-prone areas. Recent research has found that human activities were responsible for starting more than 80% of all wildfires in the United States, while also increasing the length of the fire season. This year’ s wildfire season may offer the western US a chance to catch its breath after several years of record-breaking fires. But with climate change expected to deepen the hot, dry conditions that enable such record-breaking fires, we must be preparing for a future with even more fire. Carly Phillips contributed to this post. We publish a number of guest posts from experts in a large variety of fields. This is our contributor', metadata={'domain': 'cleantechnica', 'id': 59, 'title': 'What Does A β€œ Normal ” Year Of Wildfires Look Like In a Changing Climate?', 'url': 'cleantechnica.com/2023/05/15/what-does-a-normal-year-of-wildfires-look-like-in-a-changing-climate'})],
 'question': 'Where are the biggest increases in wildfire smoke exposure in recent years?',
 'answer': 'The biggest increases in wildfire smoke exposure in recent years are in the Great Plains region, from North Dakota to Texas.'}
datasets["hyde"] = get_or_create_eval_dataset("hyde", eval_df, hyde_chain)
Loaded hyde dataset from data/gold/datasets/hyde_dataset.json
llm_results["hyde"] = get_or_run_llm_eval("hyde", datasets["hyde"], llm)
Loaded hyde evaluation results from data/gold/results/hyde_llm_eval_results.csv
strategy_results["hyde"] = llm_results["hyde"]
plot_multiple_evals(strategy_results)
No description has been provided for this image No description has been provided for this image

Just like with multi-querying we can see that the answer correctness increases when using the HyDE method.

Other MethodsΒΆ

There are many other methods that can be used to improve the RAG pipeline. Some of these include:

os.system("jupyter nbconvert --to html --template pj cleantech_rag.ipynb")
0